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Preface 



The first edition having been sold out, gives me a welcome opportunity to 
augment this volume by some recent applications of speech research. A new 
chapter, by Holger Quast, treats speech dialogue systems and natural lan- 
guage processing. 

Dictation programs for word processors, voice dialing for mobile phones, 
and dialogue systems for air travel reservations, automated banking, and 
translation over the telephone are at the forefront of human-machine inter- 
faces. Spoken language dialogue systems are also invaluable for the physically 
handicapped. 

For researchers new to the field, the new chapter (pp. 67-106) provides 
an overview of fundamental linguistic concepts from phonetics, morphology, 
syntax, semantics and pragmatics, grammars and knowedge representation. 
Symbolic methodology, such as Norman Chomsky’s traditional hierarchy of 
formal languages is layed out as are statistical approaches to analyze text. 
Proven tools of language processing are covered in detail, including finite- 
state automata, Zipf’s law, trees annd parsers. The second part of the new 
chapter introduces the building blocks of state-of-the-art dialogue systems. 

Not surprisingly, a lot remains to be done in linguistics, as illustrated by 
a recent automatic translation of my talk on the history of nuclear fission 
(http://www.physik3.gwdg.de/'^mrs/) in which its discoverer, Otto Hahn, 
came out as a petrol cock. The program obviously didn’t recognize Otto 
Hahn as a name and rendered the German Hahn as cock in English. But why 
petrol? Here the program was too clever by half. A gasoline or petrol engine 
is called Otto-Motor in German. Thus Otto Hahn = petrol cock. 



Berkeley Heights and Gottingen, April 2004 



Manfred Schroeder 




Preface to the First Edition 



World economies are increasingly driven by research, knowledge, information, 
and technology - with no small part being played by computer speech: 

- automatic speech recognition and speaker authentication, 

- speech compression for mobile phones, voice security and 
Internet “real audio,” 

- speech synthesis and computer dialogue systems. 

According to the London Economist information technology represents “a 
change even more far-reaching than the harnessing of electrical power a cen- 
tury ago . . . Where once greater distance made communications progressively 
more expensive and complicated, now distance is increasingly irrelevant.”^ 
Natural speech, predominant in human communication since prehistoric 
times, has acquired a brash new kin: synthetic or computer speech. While 
people continue to use speech liberally, many of us are increasingly exposed to 
speech manufactured and understood by computers, compressed or otherwise 
modified by digital signal processors or microchips. This evolution, driven by 
innumerable technical advances, has been described in a profusion of scientific 
papers and research monographs aimed at the professional specialist. The 
present volume, by contrast, is aimed at a larger audience: all those generally 
curious about computer speech and all it portends for our future at home 
and in the workplace. I have therefore kept the technical details in the first 
chapters to a minimum and relegated any unavoidable “mathematics” to the 
end of the book. 

Considering the importance of hearing for speech compression, speech 
synthesis (and even recognition), I have included a brief overview of hearing, 
both monaural and binaural. 

I have also added a compact review of signal analysis because of its rele- 
vance for speech processing. 

For the benefit of readers new to speech and computers I have added 
a glossary of terms from these fields. I have also augmented the numbered 



^Frances Cairncross, The Economist^ 13 September 1997. Quoted from The New 
York Review of Books, 26 March 1998 (p.29). But see also Thomas K. Landauer, 
The Trouble with Computers (MIT Press, Cambridge, MA 1995). 
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references by a list of books for general reading, selected journals, and major 
meetings on speech. 

Progress in any field does not arise in a vacuum but happens in a man- 
ifestly human environment. But many scientists, Carl Friedrich Gauss, the 
“Prince of Mathematicians” foremost among them, take great pride in oblit- 
erating in their publications any trace of how success was achieved. From this 
view, with all due respect for the great Gauss, I beg to differ. Of course, in 
mathematics, the “conjecture - lemma - theorem - proof” cycle will always 
remain the mainstay of progress. But I believe that much is to be gained 

- and not only for the nonspecialist - by putting scientific advances in a 
personal context. 

Some Personal Recollections 

This book has its genesis in a visit by a Bonn linguist to the University of 
Gbttingen in Germany shortly before Christmas 1951. The noted speech sci- 
entist had been invited to give a talk at the General Physics Golloquium, not 
on the usual nuts-and-bolts exploits in physics, but on Shannon’s communi- 
cation theory and its potential importance for human language, written or 
spoken. 

After the distinguished guest had safely departed, most of the physicists 
present professed “not to have understood a word.” During the ensuing de- 
partmental Christmas party the chairman, the reigning theoretical physi- 
cist, asked me how I had liked the lecture. When I admitted that (although 
far from having understood everything) I was deeply impressed, the profes- 
sor’s answer was a disapproving stare. How could anything be interesting, he 
seemed to be saying, that did not partake of Planck’s quantum of energy or 
general relativistic covariance. 

I was working on concert hall acoustics at the time, using microwave 
cavities as a convenient model for acoustic enclosures. I was soon astounded 
by the chaotic distribution of the resonances that my measurements revealed 

- even for relatively small deviations from perfect geometric symmetry of the 
cavity: I had stumbled on the very same probability distribution that also 
governs the energy levels of complex atomic nuclei and that Eugene Wigner in 
Princeton had already promulgated in 1935. But at that time and for decades 
thereafter few (if any) physicists appreciated its general import. 

Now, 60 years later, the Wigner distribution is recognized as a universal 
tell-tale sign of “non-integrable” dynamical systems with chaotic behavior. 
But, of course, chaos hadn’t been “invented” yet and I for one didn’t know 
enough nuclear physics - too much like chemistry I thought - to see the 
connection. Nor, I think it is safe to say, did atomic physicists know enough 
about concert halls or microwave cavities to appreciate the common thread. 
Interestingly, the Wigner distribution also raises its head in - of all things - 
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number theory, where it describes the distribution of the (interesting) zeros 
of the Riemann zeta-function. 

Be that as it may, my thesis advisor was much impressed by my progress 
and suggested that I go to America, the “microwave country.” The thought 
had crossed my mind and it did sound interesting. So I applied for a Fulbright 
Fellowship - and failed flat out. My academic grades were good, my English 
was passable but, the Fulbright commission concluded, I was not politically 
active enough. They were looking for foreign students who, upon their return 
from the States, could be expected to “spread the gospel” about democracy, 
the American Way of Life and so forth. But I, it seemed, was mostly interested 
in physics, mathematics, languages, photography, tennis and - perish the 
thought - dancing. 

During my thesis on the statistical interference of normal modes in microwave 
cavities and concert halls, I discovered the work of S. O. Rice on random 
noise, which explained a lot about sound transmission in concert halls. ^ 

Now Rice, I knew, was with Bell and so was Shannon. So after my Ful- 
bright failure I went to my professor and asked him for a recommendation 
for Bell Laboratories. But he told me that Bell didn’t take any foreigners. In 
fact, in 1938 he had backed one of his better students (who was eager to leave 
Germany) but Bell declined. (I still wonder what was behind that rejection.) 

Then, in early 1954, James Fisk (later president of Bell Laboratories) 
and William Shockley (of transistor fame) traveled to Germany (Fisk had 
studied in Heidelberg) to look for young talent for Bell. When I heard about 
this, I immediately went again to my professor telling him that Bell was 
not only accepting foreigners but they were actively looking for them. This 
time around the kind professor promised a recommendation and just three 
weeks later I received an invitation from Bell for an employment interview in 
London. 



An Unusual Interview 

The venue of the interview was the lobby of the Dorchester Hotel. After 
explicating my thesis work, I asked the recruiter to tell me a bit about the 



^It transpired that such transmission functions were basically random noise, 
albeit in the frequency domain, and that room acousticians, for years, had measured 
nothing but different samples of the same random noise in their quest for a formula 
to pinpoint acoustic quality. Within a limited cohort of acousticians I became quite 
well-known for this work although it was just an application of Rice’s theory. But 
it did establish a new area of research in acoustics (and later microwaves and 
coherent optics): random wave fields^ with many interesting applications: acoustic 
feedback stability of hands-free telephones and public address systems, fading in 
mobile communications and laser speckle statistics. 
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Bell System. “Well, there was AT&T, the parent company. Western Electric, 
the manufacturing arm. Bell Laboratories, and 23 operating companies: New 
York Telephone, New Jersey Bell, Southern Bell. . . ” when, in the middle 
of this recitation, he stopped short and, with his eyes, followed an elegant 
young lady (an incognito countess!) traversing the long lobby. After a minute 
or two, without losing a beat, he continued “yes, and there is Southwestern 
Bell, Pacific Telephone, . . . ”^ 

Everything went well during the interview, and I soon received an offer of 
employment. The monthly salary was $640 - five times as much as a young 
Ph.D. could have made at Siemens (500 Marks). But, of course, at Bell I 
would have worked for nothing. 

On September 30, 1954, arriving in New York, I stepped off the Andrea 
Doria (still afloat then) and into a chauffeured limousine which took me 
and my future director and supervisor to one of the best restaurants in the 
area. I couldn’t read the menu (mostly in French) - except for the word 
Bratwurst. So, with champagne corks popping and sparkling desserts going off 
at neighboring tables, everybody in our party had Bratwurst and Ldwenhrdu 
beer (my future bosses were obviously very polite people) - I wonder how 
many immigrants were ever received in such style. 

Arriving at the Murray Hill Labs, I was put on the payroll and given a 
dollar bill as compensation for all my future inventions. (When I retired 33 
years later I had garnered 45 U.S. patents and innumerable foreign filings, 
but the fact that I had earned less than 3 cents for every invention never 
bothered me.) 

Once securely ensconced at Murray Hill, I was encouraged to continue my 
work on random wave fields but - in typical youthful hubris - I thought I 
had solved all relevant problems and I elected to delve into speech. Speech, 
after all, meant language - always a love of mine - and possibly relevant to 
the telephone business. 



The Bachelor and the Hoboken Police 

William H. Doherty, my first Executive Director at Bell, introduced me to 
one and all as “Dr. Schroeder, who just joined us from Germany - and he 

^Twenty-five years later, to the day, on April 25, 1979, I went back to the Dor- 
chester. At the far end of the lobby there was a kind of hat-check counter with an 
elderly lady behind it. I went up to her and asked “Could it be that on this day, 
25 years ago, on April 25, 1954, a Sunday, a young woman might have appeared 
from the door behind you - it was about 2 p.m. - crossed the lobby and then 
exited by the revolving door?” She must have thought I was from Scotland Yard or 
something. But, unflustered, she answered “Oh yes, of course, at 2 p.m. we had a 
change of shifts then. This entrance was for service personnel - chambermaids and 
so forth.” I asked her, “How do you know this?” And she said, “I have been here 
for 30 years.” 




Preface to the First Edition XIII 



is a bachelor.” I was 28 then but apparently already getting a bit old for 
a single Catholic male. (I later had to disappoint Bill by marrying in an 
interdenominational ceremony - at the Chapel of the Riverside Church on 
the Hudson - my wife being Orthodox.) 

When one night, carrying a camera with a very long lens, I was arrested by 
dockworkers on the Hoboken piers (believing they had caught a foreign spy) , 
he vouchsafed for me that I was just taking pictures of the full moon rising 
over the New York skyline across the Hudson. (The dockworkers had called 
the Hoboken police, who, however, were completely sidetracked by my col- 
lection of photographs of beautiful models. The dock hands were apparently 
not satisfied with this result of their “citizen arrest” and the incident was 
raised again years later during my naturalization proceedings when Doherty, 
now assistant to the President, rescued me once more.) 



Gottingen and Berkeley Heights, Manfred Schroeder 

February 1999 
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In my career in speech I am indebted to many people, not least the “noted 
linguist” noted above: the late Werner Meyer- Eppler, who sparked my interest 
in Shannon’s communication theory and linguistics. 

Professor Erwin Meyer was the professor who rescued me from the 
clutches of German industry with a scholarship for my Ph.D. And it was 
he who wrote the recommendation that propelled me first to London and 
then on to New York and Murray Hill. 

Winston Kock was the polite director who had received me in such style 
in New York. 



Bell People 

Starting out in a new field is never easy but I learned a lot - much by 
osmosis - from the speech pioneers at Bell: Ralph Miller, my first supervisor, 
Harold Barney of formant-frequency fame, Homer Dudley, the inventor of the 
vocoder, Hugh Dunn, who had built an early electrical model of the vocal 
tract, Warren Tyrrel, inventor of the microwave Magic Tee, and the kindly 
Floyd Harvey, who was assigned to “teach me the ropes.” 

Within a year or two after my arrival at Murray Hill a crew of contempo- 
raries appeared on the speech scene at Bell: the Swiss physicist Erich Weibel, 
the mathematician Henry Kramer (originally from Dusseldorf), Hank Mc- 
Donald, the instrumentation genius, and Max Mathews who taught us all 
that working with analog circuits was basically solving mathematical equa- 
tions and that this could be done much better on a “digital computer” an 
idea that I took up with a vengeance, soon simulating sound transmission in 
concert halls with computer running times a thousand times real time.^ 

I must make special mention here of Ben Logan (“Tex” to the country 
music community) with whom I worked very closely for many years and who 
taught me electrical engineering. I will never forget when, reading an article 
bemoaning the poor sound quality of artificial reverberators, I turned to Ben 



^When the first artificially reverberated music emerged from the digital-to- 
analog converter, the computer people thought that this was a prank, that I had 
hidden a tape recorder behind their racks. 
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(literally - we were sharing an office) asking him whether there were allpass 
filters with an exponentially decaying impulse response. Ben’s answer was 
yes and the allpass reverberator was born. 

At about the same time, shortly after the Hungarian revolution - on 
December 31, 1956, to be exact - Bela Julesz, arriving through Camp Killmer, 
New Jersey, knocked on our door and was immediately admitted. Bela became 
a lifelong friend and limitless source of intellectual stimulation. 

Later the acoustics department was merged with the “vision” people 
which brought me into contact with the psychologist Georg Sperling, Leon 
Harmon (“Einstein’s technical aide” and jack-of-many-trades), the superla- 
tive John L. Kelly, Jr., who had given Shannon’s concept of information rate 
a new meaning (in a gambling context). Kelly, with Lou Gerstman, digitally 
simulated the first terminal- analog speech synthesizer. 

Of the people I encountered at Bell perhaps the most noteworthy was Ed 
David who had the inspiration to persuade our immediate superiors. Vice 
President for Research William O. Baker, and John R. Pierce, of satellite 
communication fame, to establish, at Bell Laboratories, a broadly based re- 
search group in human communication - complementing the usual engineers, 
physicists and mathematicians by psychologists, physiologists and linguists. 
It is in this environment that I got to know many outstanding people from 
a broad range of scientific disciplines: Peter Denes, the linguist from Eng- 
land, who introduced online computing in speech research, Gerald Harris 
from nuclear physics (who showed that the threshold of hearing was just 
above the molecular Brownian noise in the inner ear), the psychologists New- 
man Guttman, Thomas Landauer, and Lynn Streeter, and the physiologists 
Willem van Bergeijk, Larry Frischkopf, Bob Capranica, and Ake Flok from 
Sweden. 

Ed David also inaugurated a visiting scholar program with MIT and Har- 
vard which brought to Murray Hill such luminaries as the neural-network 
pioneers and authors of “A Logical Calculus Immanent in Nervous Activity,” 
Walter Pitts (typically delivering his lectures from a crouching position), and 
Warren McCulloch, the noted linguists Roman Jacobson, Morris Halle, Gun- 
nar Fant, Ken Stevens, and Arthur House; also Walter Rosenblith and the 
inimitable Jerry Lettvin (the latter with a live demonstration of “what the 
frog’s eye tells the frog’s brain”). 

On the social side, Ed and his charming wife Ann acted as my chaperones 
(sometimes too stringently, I felt) in converting me from a peccable European 
to a good American. (Ed also showed me how to boil lobster and the secrets 
of properly grilling steak.) And Ann, hailing from Atlanta, taught all of us 
newcomers the essence of a gracious social life. 

In 1956, Ed was my best man when I married Anny Menschik in New York. 
Later he and Newman Guttman acted as sponsors for my U.S. citizenship. 
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At the Labs I also profited much from my close associations with the res- 
ident mathematicians: the critical but always helpful David Slepian, Ron 
Graham, Dick Hamming, who taught me how totally counterintuitive higher 
dimensional spaces are, Jessie MacWilliams, who introduced me to primi- 
tive polynomials, Ed Gilbert, who solved an important integral equation on 
reverberation for me, Aaron Wyner, Neil Sloane, who worked with me on a 
data compression scheme involving new permutation codes, Andrew Odlyzko, 
Henry Landau, Leopold Flatto, Hans Witsenhausen, Larry Shepp, and Jeff 
Lagarias, who infused dynamical systems (chaos and fractals) with number 
theory. 

Elwyn Berlekamp, an early friend, supplied the mathematical underpin- 
nings for my proposal to use Hadamard smearing in frame difference picture 
coding. 

Ingrid Daubechies, now at Princeton, initiated me into the world of 
wavelets and redressed my Flemish at the Bell Labs Dutch Table. 

I also knew, albeit in a more cursory manner, the “old guard” : Hendrik 
Bode and Serge Schellkunoff, with whom I shared an interest in microwaves. 
The fatherly Harry Nyquist and I worked together for a while on underwater 
sound problems. Claude Shannon once came to my office with a funny lit- 
tle trumpet wanting to know whether I could compute its resonances. David 
Hagelbarger introduced me to several of his “machinations,” including Shan- 
non’s outguessing machine. 

The Bell mathematician with whom I had the most frequent contacts for 
over 30 years was John Wilder Tukey, one of the sharpest minds around, who 
split his time between Bell and the Princeton Statistics Department. Tukey, 
who seemed to thrive on half a dozen glasses of skim milk for lunch, was the 
first human parallel processor I have known: during staff meetings he would 
regularly work out some statistical problem while hardly ever missing a word 
said during the discussion. He is of course best known for his (re) invention, 
with IBM’s Jim Cooley, of the fast Fourier transform, which changed the 
topography of digital signal processing (never mind that Gauss had the FFT 
150 years earlier). Tukey was also a great wordsmith: he coined the terms 
bit and cepstrum^ (the Fourier transform of the logarithm of the Fourier 
transform). But some of his cookier coinages, like quefrency (for cepstral 
frequency) and saphe (for cepstral phase) didn’t catch on. 

Henry Poliak, who directed mathematics research during my tenure as 
director of acoustics and speech research, was an unstinting source of math- 
ematical talent for our down-to-earth problems in speech and hearing. (But 
his devotion to administrative detail never rubbed off on my more nonchalant 
style of management.) 

Of all the people I worked with at Bell only one was a bona-fide speech re- 
searcher before joining us: Jim Flanagan, whose thesis at MIT was concerned 
with the measurement and perception of formant frequencies. Jim served as 
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one of the department heads who reported to me during one of the most 
fertile periods in speech research at Bell, an epoch to which Jim contributed 
mightily. 

Three other department heads who added greatly to the prestige of acous- 
tics at Bell Labs were Peter Denes, author (with Elliot Pinson) of the Speech 
Chain^ Bob Wallace (inventor of the tetrode transistor and several ingenious 
directional microphones for conference telephony) and Warren Mason (holder 
of over 200 U.S. patents). Warren in turn was aided by Orson Anderson (who 
invented compression bonding), and Bob Thurston and Herb McSkimmin, 
wizards of ultrasonic precision measurements. 

A physicist by training, I also enjoyed friendly relations with a few of the 
physicists at Bell, notably Sid Millman (long in charge of AT&T’s University 
Relations and co-editor of the Bell System History), Conyers Herring, with 
whom I shared an interest in Russian, Phil Anderson, who was our guest when 
the Gbttingen Academy awarded him the Heinemann Preis, Stan Geschwind, 
and the laser pioneer Jim Gordon, whom my wife Anny knew from Columbia, 
A1 Clogston, inventor of end-fire arrays, Rudi Kompfner, inventor of the trav- 
eling wave tube, Arno Penzias and R.W. Wilson, discoverers of the Big Bang 
radiation. Bill Pfann of zone refining fame, Walter Brattain, inventor of the 
point-contact transistor whom I didn’t get to know until an Institut de la Vie 
meeting in Paris in 1976, and Phil Platzman and Horst Stormer, co-discoverer 
of the fractional Hall effect, who, for many years, kept me supplied with my 
favorite engagements calendar that includes the consumer price index for the 
last 50 years and a chart of vintage wines. 

I also kept in touch with several psychologists, notably Roger Shepard, 
whose work on scaling I greatly admired, Saul Sternberg, Ernie and Carol 
Rothkopf, Joe Kruskal and Doug Carroll, whose multidimensional preference 
algorithms came in very handy when I became interested in the subjective 
quality of sounds and concert halls. 

Attorneys Harry Hart and A1 Hirsch steered my patents from rough design 
to legal perfection, often adding their own innovations. 

Atal et al. 

My sojourn at Bell was blessed with an abundance of new talent, Bishnu 
Atal foremost among them. I had heard about Bishnu from my old ally in 
room acoustics, Dick Bolt of Bolt, Beranek, and Newman in Cambridge, 
Massachusetts. I don’t recall why BBN would or could not hire Bishnu but, 
even without a Ph.D., he sounded too good to miss. I therefore wrote Bishnu 
and called him at the Indian Institute of Science in Bangalore. This was 
in 1961, before satellite communication, and while everything was crisp and 
clear up to New Delhi, communication completely fell apart beyond Poona, 
India. I had started our “conversation” with the rhetorical question whether 
he had already received my letter. Sorry to say, after forty-five minutes of 
misunderstandings and repetitions (some with the help of telephone operators 
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in White Plains (New York), London, New Delhi, and Poona) that question 
was still unanswered. So I finally gave up and said “I enjoyed talking to you. 
I will confirm all this by letter.” (It is only fair to add that the telephone 
company - fully aware of the non-communication - charged us only for three 
minutes connection time.) 

In due course Bishnu arrived at Murray Hill, and the stage was set for 
a close and enduring collaboration, first in hearing and room acoustics and 
before too long in speech. Simultaneously, Bishnu worked on his Ph.D. from 
Brooklyn Polytechnical Institute (with a thesis on speaker recognition). 

The 1960s also saw the advent of such outstanding researchers as Mike Noll, 
with whom I explored visual perception of four-dimensional spaces and who 
later advanced to an office in the White House, Gerhard Sessler from Ger- 
many, Hiroya Fujisaki, Sadaoki Furui, and Osamu Fujimura from Tokyo; Mo- 
han Sondhi who came via Canada, Cecil Coker, Paul Mermelstein, Harry 
Levitt, Aaron Rosenberg, Noriko Umeda, Jont Allen, Dave Berkeley, Allen 
Gersho, Gary Elko, Jim West, Roger Golden, Oded Ghitza, Naftali Tishby, 
and Joseph Hall, with whom, in the 1970s, after a long slumber, I revived 
hearing research at Bell. 

I also enjoyed many instructive encounters with yet another generation 
of linguists at Bell: Mark Liberman, Ken Church, and Ken Silverman.^ Dan 
Kahn and Marian Macchi, with whom I once worked a bit on poles, zeros and 
nasals, later developed the speech synthesis program Orator"*^^ at Bellcore. 

Bell Laboratories has benefitted greatly from the cooperative program in 
electrical engineering with MIT. Under the auspices of this program, I met 
some of my most able students, notably Dick Hause, who explored single- 
sideband modulation and infinite clipping, and Tom Crystal, with whom I 
worked on time-domain vocoders. 

The prolific Larry Rabiner worked with us on digital circuits and bin- 
aural hearing from 1962 to 1964. In 1967 he joined Bell Labs as a full-time 
researcher and before too long rose on the corporate ladder to Department 
Head, Director and Vice President before he left for (stayed with?) AT&T. 

I cannot complete this account without mentioning the expert programming 
help I received from Carol McClellan and Lorinda Landgraf Cherry, who 



^Silverman, an Australian linguist, once got tripped up on the German word for 
emergency. Not, when he found himself trapped inside a burning building in Austria. 
Every door he approached repulsed him with a forbidding verhoten sign saying 
NOTAUSGANG - not exit? My increasingly frantic friend, desperately seeking 
Ausgangs, knew enough Latin and German (besides his native English) to properly 
decode aus-gang as ex-it But in the heat of the emergency, he never succeeded in 
severing the Gordian knot: Not is not not Eventually Silverman tried a “not exit” 
and escaped! 
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Fig. 1. The author with 
Lorinda Cherry and Marion 
and Senator Javitz at the 
opening of the exhibition Some 
New Beginning - Experiments 
in Art and Technology at the 
Brooklyn Museum in Novem- 
ber 1968. The wall-size poster 
in the background is a com- 
puter graphic by the author 
and was programmed by Ms. 
Cherry. It combines an image 
of the museum with the pro- 
gram of the exhibit. 




Fig. 2. One Picture is Worth a 
Thousand Words. A computer 
graphic by the author pro- 
grammed by Suzanne Hanauer. 
This image won First Prize 
at the International Computer 
Art Exhibition in Las Vegas in 
1969. 



Acknowledgments XXI 



programmed the poster for Marion Javitz’s and Billy Kliiver’s^ avantgarde 
art show Experiments in Art and Technology at the Brooklyn Museum, see 
Fig. 1, and Sue Hanauer who realized the image One Picture is Worth a 
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Fig. 3. Prime Spectrum. Another computer graphic by the author with Sue 
Hanauer. The image, originally containing 1024x1024 points, shows the Fourier 
transform of the distribution of the relative primes. It was calculated by a fast 
Fourier transform and (in 1969) required shutting down the Bell Laboratories com- 
puter center to marshal the necessary core memory. 



®The Swede Kliiver is now “immortalized” as the hapless depression-era “Amer- 
ican” farmer in the new Franklin Delano Roosevelt Memorial in Washington. 
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1. Introduction 



Language was created by Nature - like Vision or Digestion. 

Epicurus (341-270 BC) 

Language ability is a gift of God - except for Chinese, 
which is the invention of a wise man. 

Gottfried Wilhelm von Leibniz (1646-1716) 



Where does language come from? Did God create it, or is it a result of natural 
selection and fitful adaptation, as Epicurus held? Does language languish, 
once conceived, or does it grow and decay like living “matter”? And what 
did Leibniz have in mind when calling Chinese an invention of a wise man? 
Was he alluding to the fact that Chinese is a tonal language and as such 
perhaps easier to recognize?? Whatever the answers, language lets us live the 
lives we love and like to talk about. 

In this introductory chapter we shall explore some of the possibilities of 
language that were neither created by God nor foreseen by Nature: linguistic 
signals fashioned not by human tongues but manufactured by machines. And 
we shall likewise ponder speech signals perceived by machines rather than 
the human ear. Let us take a look at some “ancient” history of our subject. 



1.1 Speech: Natural and Artificial 

Language - both written and spoken - is the primary means of communica- 
tion between humans. Written language and the richness of spoken idioms 
distinguish the human species from all other forms of life: language is the 
very essence of humanness and humanity. 

Yet, human communication by language is beset by many obstacles, from 
the simple misunderstanding within a single dialect to the “untranslatable” 
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slang between different languages and cultures. Illiteracy, much more com- 
mon than naively assumed, is one of the great banes of civilization. For the 
hard-of-hearing, the deaf, the speechless, and the deaf-mute, spoken language 
communication is impaired or impossible. But even for the sound listener, the 
screaming of an overhead plane, the noisy restaurant, the roar from a nearby 
highway, lawn mower, or leaf blower - or the chatter of a cocktail party in 
full swing - can make normal speech perception difficult if not hopeless. 

The large distances covered by modern trade and travel demand new 
means of rapid and reliable communication. The traveler - on the road, in 
the air, on a train - calls for new methods of mobile communication that offer 
maximum convenience and privacy without cluttering up scarce “air” space. 
While optical fibers offer ever more communication capacity, the usable radio 
frequency spectrum for wireless communication by mobile phones is limited. 

Many of the problems of spoken language communication - both ancient 
and those engendered by modern mores and technology - are amenable to 
amelioration by emerging strategies of speech processing: the transformations 
of speech signals for more efficient storage and transmission, for enhanced 
intelligibility and ease of assimilation. These new departures in spoken lan- 
guage processing should be based on a thorough understanding of how hu- 
mans speak and how they hear the sounds that impinge on their ears - and 
how they absorb the intended message. 

On the technical side, it is the emergence of fast algorithms, such as 
the Fast Fourier Transform, and the tiny transistor and its latter-day descen- 
dants, the integrated circuit and computer chip, that have made sophisticated 
signal processing a present reality. Modern computers and digital signal pro- 
cessors can do good things to speech that would have taken many bays of 
analog equipment not too long ago. And the future holds further advances at 
the ready: digital hearing aids that not only amplify speech but suppress un- 
wanted sounds; voice synthesizers that actually sound human; public-address 
systems that work; better book-reading aids for the blind; error-free auto- 
matic speech transcription (the voice- typewriter or “electronic” secretary); 
and reliable speaker verification for access to confidential data and limited 
resources (such as one’s own bank account). 

Many of these modern miracles can trace their ancestry to an invention 
made in 1928. 



1.2 Voice Coders 

In 1928 a new transatlantic telegraph cable was installed between Britain and 
the United States. Its frequency bandwidth was an unheard-of 100 Hertz, an 
order of magnitude wider than previous cables. Still its bandwidth was 30 
times narrower than that required for the transmission of intelligible tele- 
phone speech. Yet this did not deter an engineer at Bell Telephone Laborato- 
ries in New York named Homer Dudley from thinking about sending speech 
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under the Atlantic. Dudley reasoned that speech is produced by the sluggish 
motions of the tongue and the lips and that meager information should easily 
fit within the capacity of the new cable. 

Unfortunately Dudley’s idea foundered as soon as it was born because 
neither he nor anyone else knew how to extract tongue and lip positions from a 
running speech signal. These “articulator” positions change the resonances of 
the human vocal tract and these resonances, in turn, determine the frequency 
content, the sound, of the speech signal. After all, that is how humans hear 
and distinguish different speech sounds, like the “ah” in the word bar, or “ee” 
in beer. But the frequency content, called the spectrum as opposed to the 
articulator positions - can be easily measured. Dudley therefore suggested 
transmitting this spectrum information across the Atlantic, rather than the 
speech signal itself, and reconstructing a replica of the original speech at the 
other end. The spectral description should easily fit into the new cable. The 
resulting invention, first described in his Lab notebook in October 1928, was 
called Vocoder (coined from Voice coder) by Dudley [1.1]. 

The Vocoder is the grandfather of modern speech and audio compression 
without which spoken language transmission on the Internet (think of In- 
ternet telephony and live broadcasts over the World Wide Web) would be 
impossible. 

Of course it was not easy to build the first Vocoder and its manually 
operated cousin, the Voder. With their profusion of resistors, capacitors, in- 
ductances, and vacuum tubes (there were no transistors then, let alone inte- 
grated circuits) these early speech coders were truly monstrous. Nevertheless, 
the monsters were eventually built and demonstrated to great public acclaim 
at the New York World Fair in 1939. An electric speaking machine! 

The Vocoder was first drafted into service during World War II in the 
secret telephone link between British Prime Minister Winston Churchill in 
London and President Roosevelt in the White House in Washington and 
Allied military headquarters on five continents. The vocoder allowed enough 
compression of the speech signal so that it could be digitized by as few as 
1551 bits per second and subsequently encrypted by what professional spooks 
call a “one-time-pad,” the most secure method of encryption.^ 

This was the actual beginning of the digital speech age, but it was kept 
under wraps until thirty years after the war for reasons of military secrecy. 
For more details of this epic story, the reader is referred to the eyewitness 
account by Ralph L. Miller, one of the few surviving participants [1.2]. 

The vocoder has spawned many other methods of speech compression, not 
least Linear Predictive Coding (LPC), which has enriched not only speech 
compression but speech recognition and voice synthesis from written text. 

^The secret key was obtained from thermal (Johnson) noise - not from some 
algorithmic noise generator, which might have been broken. The random bits ex- 
tracted from the thermal noise were recorded on phonograph disks that were hand- 
carried between the continents. Synchronizing these disks was one of the many 
technical problems that had to be surmounted. 
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Compression is crucial for “real audio” on the World Wide Web and voice 
security for the Internet. Speech recognition and synthesis are ever more 
ubiquitous in all kinds of verbal information services. Information that one 
can listen to with one’s ears instead of having to be read by eye is a mode 
of communicating that is a lot less dangerous for the driver on the road or 
the surgeon at the operating table - to mention just two of the innumerable 
applications. 

Since the vocoder separates the excitation signal (the glottal puffs of air) 
from the transfer function (the vocal tract) all kinds of games can be and 
have been played, like “head switching” (see Sect. 1.10) and changing a male 
voice into a female (Sect. 1.11). Some pranksters have even used the noise 
from a sputtering and stalling Volkswagen engine as an excitation signal and 
having the VW whimper “I am out of gas ... I am dying.” Cute effects can 
be achieved by using organ music as an excitation signal and having the 
organ preach and admonish the congregation in no uncertain tones - a kind 
of virtuous reality. 



1.3 Voiceprints for Combat and for Fighting Crime 

The linguistically important elements in a speech signal can be represented 
visually to the human eye. Such a graphical rendition of speech is called a 
speech spectrogram or, somewhat misleadingly, “voice print.” Sound spectro- 
grams portray the energy in an audio signal, shown as different shades of 
gray or color, as a function of frequency and time, see Fig. 1.1. The temporal 
resolution of the underlying Fourier analysis is about 10 ms - like that of hu- 
mans listening to speech. The corresponding frequency resolution is roughly 
100 Hz. For better spectral resolution, the temporal window can be increased 
in duration. Phase information is discarded in the spectrogram. Thus, speech 
spectrograms show the linguistically important information of an utterance: 
what was said and, to an extent, who said it. 

In addition to the gray or color scale, contours of equal energy can be 
superimposed on the spectrograms. This makes them look vaguely like a fin- 
gerprint (hence the term voiceprint) ^ but their practical usefulness is thereby 
hardly enhanced. Figure 1.2 shows an image composed entirely of contour 
lines. Adorning the cover of a book on speech recognition, this picture was 
once mistaken, by a prominent speech scientist, as a voice print [1.3]. 

In principle, trained people should be able to “read” sound spectrograms. 
But in practice this has proved difficult, especially on a real-time basis. This 
is unfortunate because a running sound spectrogram, displayed on a video 
screen, would have given the deaf a measure of speech communication beyond 
lip reading and tactile sensors. This effort, alas, has failed, but not for want 
of exertion. Still, as a tool in speech research, the sound spectrograph - 
originally an analog device, but now often implemented on computers - has 
proved of great value. 
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Fig. 1.1. Speech spectrogram or “voice print” of the utterance “computer speech?” 
by a male speaker. Time, in seconds, runs from left to right. Frequency, in Hertz, 
runs vertically. Increasing spectral intensity is indicated by increasing degrees of 
blackness. The dark areas are the formants or resonances of the vocal tract. The 
striations, visible especially at the end of the utterance (around 2.5 s), are the fun- 
damental frequency and its harmonics. They are seen to be rising as appropriate 
for a question (“computer speech?”). Such spectrograms contain much of the lin- 
guistic information in a speech signal but are difficult to “read” and controversial 
for identifying voices from a large group of speakers. The designation “voice print” 
is therefore misleading 



In addition to identifying sounds, speech spectrograms can be used to 
identify individual speakers from a limited pool of potential voices. In fact, 
sound spectrographs were originally developed in the United States during 
World War II to analyze radio voice communication by the enemy. One aim 
was to identify individual radio operators at the division level and to infer 
from their movements on the other side of the front line impending offensive 
actions by the opposing forces. The famous Battle of the Bulge was thus 
signalled long before its starting date (16 December, 1944).^ 

In forensic applications, voiceprints are especially useful in eliminating 
a suspect because, with a given vocal apparatus, some suspects could not 
possibly have produced the recorded utterance. But the general applicability 
of voiceprints in criminal trials remains doubtful [1.4]. 



^Another piece of evidence was the sudden change of secret radio codes preced- 
ing this last-gasp German offensive. The new code, however, was almost instantly 
cracked by the Allies because one German commander couldn’t decipher the new 
code and asked for a retransmission in the old - already broken - code. The re- 
transmission opened the door to what is known in the deciphering community as a 
“known dear-text” attack. 
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Fig. 1.2. Sometimes speech spectrograms are adorned by contour lines of equal 
spectral intensity to make them look more like fingerprints. This figure was once 
misread by a leading speech scientist as such a contour spectrogram. However the 
contours in this computer graphic represent lines of equal optical intensity in the 
likeness of a bearded young man (Wolfgang Moller, who programmed this image). 
To properly see the subject it helps to view it from a large distance. - This computer 
generated image, called “Eikonal Portrait,” is an example of the efforts by the 
author and his colleagues at Bell laboratories, especially the late Leon Harmon, to 
create images that show different things at different viewing distances 



By contrast, voiceprints have been spectacularly successful in supplying 
clues for sorting out the causes of several man-made disasters. One fateful 
case was the collision of two airliners over the Grand Canyon in 1956. The 
last words recorded from one of the planes, uttered in a screaming voice just 
before the crash, were “We’re going in. . . ” . Apparently, the screamer had 
just noticed the other plane closing in. But which member of the crew was 
it? Although the utterance was utterly unnatural in pitch, analysis of the 
spectrogram revealed that the speaker had an unusually long vocal tract, 
as betrayed by his resonance frequencies being noticeably lower than those 
produced by a normal vocal tract of 170-mm length. Thus, the speaker was 
probably an exceptionally “big” person. And there was, in fact, such a person 
on that plane - the copilot. From his position in the cockpit, investigators 
could infer the direction from which the other plane was probably approach- 
ing - a crucial piece of evidence in the reconstruction of the disaster. 

Another tragic case in which the voiceprint of a last message was instru- 
mental in pinpointing the cause of the calamity was the fire in an Apollo 
space capsule that cost the lives of three American astronauts during a train- 
ing run on the ground. Here the last recorded message, again screamed in a 
terrified voice, began with the words Fire! We are burning up. . . ” . Which of 
the three astronauts was the speaker? - because it was he who had presum- 
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ably first seen the fire. Because of the highly unnatural, screaming quality of 
the signal (the pitch was above 400 Hz), the voice could not be identified by 
human listeners familiar with the voices of the astronauts. But comparisons 
of the voiceprint with voiceprints of the normal voices revealed the identity 
of the screamer. He was sitting not in the middle between his two mates 
but off to one side - the side where, in all likelihood, the fire had started. 
This pinpointing of the location of the fire’s origin was an important clue in 
NASA’s investigation and a subsequent improved design of the space capsule 
(including, incidentally, the elimination of pure oxygen for breathing). 

Beyond portraying human voices, sound spectrograph have been success- 
fully employed in the analysis of a wide variety of acoustic signals, including 
animal sounds. In such applications it is important to “tailor” frequency range 
and resolution of the spectrograph to the information-bearing characteristics 
of the signal. Thus, to analyze the infrasound (fractions of a Hertz) from the 
launching of a space vehicle in Florida near New York, the rocket rumble 
should be recorded (on an FM tape recorder) and translated up in frequency, 
by speeding up the playback of the tape so that the signal falls into the audio 
range. In other applications, frequency shifting may be preferable. 

The importance of selecting the proper time and frequency resolution is 
illustrated by the following near snafu. One of my doctoral students at the 
Max Planck Institute in Gdttingen, working with guinea fowl (Numida me- 
leagris), was unable to elicit the expected motion responses of the birds (head 
turning etc.) with synthetic sound stimuli which were closely patterned on 
their natural utterances. The synthetic calls sounded exactly like the natural 
ones - and looked the same on a spectrogram. But the birds wouldn’t budge. 

I suspected that, perhaps, the time resolution of the spectrograph (de- 
signed for human speech) was too low. Indeed, spectrograms of the bird’s 
natural calls at greatly reduced tape speeds revealed temporal detail in the 
100-/is range that had been completely obscured by the 10-ms time resolution 
of the spectrograph (and was inaudible to human listeners). Resynthesizing 
the calls with due attention to fine temporal detail made the birds respond 
just as they did to their own natural calls. In fact, it turned out that consid- 
erable liberty could be taken with the frequency content of the calls as long 
as the temporal structure was preserved. Thus, it seems that guinea fowl, 
and perhaps many other species, have a much greater time resolution - and 
use it - compared to mammalian monaural systems. (The human binaural 
system also resolves time differences in the 10 to 100- /as range. Otherwise the 
localization error in the horizontal plane would be uncomfortably large for 
survival.) 



1.4 The Electronic Secretary 

Automatic speech recognition has been the dream of many a manager. Think 
of a machine that automatically executes every spoken command or tran- 
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scribes your dictated letters without typographical errors, misunderstandings 
or other distractions. And what a boon speech understanding by machines 
would be for all kinds of information services. But except for rather special- 
ized tasks with a limited vocabulary, like airline and train travel information, 
progress has been slow. In fact, some recorded dialogues between humans 
and machines have been nothing less than hilarious for their unintended hu- 
mor brought about by total mutual misunderstandings between machine and 
human - such as the machine, not being able to understand a certain word, 
responding over and over again “please repeat”; and the human, instead of 
simply repeating the word, saying something like “OK, I will repeat. . . ” which 
threw the uncomprehending machine completely off course.^ 

It is clear that for automatic speech understanding to become more gen- 
erally useful, machines have to assimilate a lot more syntax, i.e. how spoken 
sentences are constructed. (Perhaps a sin tax would help, too.) They also have 
to absorb the idiosynchrasies of different individual speakers, often requiring 
long training sessions. 

Even more difficult, machines have to learn a lot more about the subjects 
that people talk about. Much like human listeners, to properly understand 
speech, machines have to know the different meanings of words and the subtle 
semantics that underlies most human discourse. But this is something very 
difficult to teach to dumb and mum machines. Humans, from day one of their 
lives (maybe even before birth) learn a lot of things acoustic by “osmosis” , so 
to speak, that are difficult to formalize and put into the kind of algorithms 
that machines prefer as their food for “thought” . 



1.5 The Human Voice as a Key 

A task closely related to speech understanding is speaker verification. Appli- 
cations range from controlling access to restricted resources, off-limit areas 
or confidential files (such as medical or criminal records). Banking by phone, 
using one’s voice for identification instead of the signature on a check, is 
another potential application.^ 

Voice recognition is also important in military chains of spoken, non-lethal 
commands. (Was it really my boss. General Halftrack, or some subordinate 
sergeant who called me with an urgent order for new golf clubs for the Strate- 
gic Weapons Reserve?) Automatic speaker verification for such applications. 



^This account reminds me of the fable of the first-time PC user who returned his 
machine to the manufacturer because it lacked an ANY key and he was therefore 
unable to execute the installation instruction “hit any key.” 

^Representatives of the American Bankers Association once told the author that 
absolute certainty of identification by voice was not necessary and that even with 
signed checks many errors occurred, including cashing checks with false signatures 
or no signatures at all, costing American banks large amounts of money. 
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in which false alarms and misses have no catastrophic consequences, hold 
considerable promise. 



1.6 Clipped Speech 

Nonlinear distortions of speech signals, such as peak clipping in overloaded 
amplifiers, lowers the quality of speech signals without, however, affecting 
the intelligibility much. In fact, as J.C.R. Licklider (one of the progenitors of 
the Internet) has taught us, “infinite peak clipping,” i.e. reducing the speech 
signal to just two amplitude levels, see Fig. 1.3, results in a square- wave like 
signal that sounds horribly distorted but, to everyone’s surprise, is still quite 
intelligible [1.5]. How is this possible? 



Undistorted 
Speech Signal 





Time 



Fig. 1.3. Top: The waveform of 
an undistorted speech signal. Bot- 
tom: the corresponding infinitely 
clipped signal. Although sounding 
highly distorted, infinitely clipped 
speech signals remain surprisingly 
intelligible 



The seemingly obvious answer was that most of the information content 
of a speech signal must reside in its zero- crossings because the zero-crossings 
are the only parts of the signal that are preserved by infinite clipping. (Math- 
ematically, infinite clipping corresponds to taking the algebraic sign and with 
zero defined as the output for a zero input.) 

However, this theory is contradicted by the fact that many operations 
that displace the zeros have little or no effect on intelligibility. For example. 
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manipulating the phase of the signal by putting it through an allpass filter 
with limited group delay distortion makes little perceptual difference. Differ- 
entiating the signal before infinite clipping even enhances its intelligibility. 
Frequency shifting the signal to higher frequencies, above 20 kHz say, before 
clipping and then frequency-shifting down again results in a further increase 
of intelligibility and even improves the quality. 

What is going on? As is well-known, human speech perception is based 
on the time-varying spectrum of the signal. Nonlinear distortion generates 
additional frequency components ( “spectral clutter” ) but does not destroy the 
spectral prominences (formants) and their movements that our ears rely on 
in decoding the signal. That is the reason for the relatively high intelligibility 
of infinitely clipped speech not the preservation of the zero-crossings.^ 

Allpass filtering can even eliminate perceptual distortion completely! This 
was demonstrated by the author by putting speech signals through allpass 
reverberators and applying (approximate) inverse filtering after infinite clip- 
ping [1.6]. If the reverberation process has an impulse response that is time- 
inverted compared to ordinary reverberation (i.e. echo intensity increases 
with increasing delay) speech becomes not only unintelligible but, at rever- 
beration times exceeding 2 s, does not even sound like speech anymore. 

Such “negative” reverberation is observed in the deep ocean for sound 
transmission over long distances (several thousand kilometers) . The strongest 
sound, traveling in a straight line at constant depth, is the slowest ray and ar- 
rives after all the weaker rays. Evolution has enabled us to “live with” normal 
reverberation, such as occurs in caves and dense forests, but our ancestors 
never encountered time-inverted reverberation in the course of evolution and 
nature apparently didn’t bother to tell us how to suppress it. 

Infinite clipping of this curious sounding signal makes it sound even more 
curious. But, to the great surprise of some life-long speech experts, a com- 
pletely undistorted signal was recovered after inverse filtering. Apparently, 
the inverse allpass filtering converts the signal-correlated distortion (per- 
ceived as nonlinear distortion) into an uncorrelated noise (perceived as an 
additive background noise). ^ 



^Differentiating and frequency-shifting increases the number of zero-crossings, 
resulting in less in-band spectral clutter. Representing a speech signal s{t) as the 
product of its Hilbert envelope a{t) and a phase factor cos0(t), infinite clipping 
after frequency shifting is equivalent to dividing the signal s{t) by its envelope a{t). 
This suggests that phase filtering to reduce the variability of a(t) (the signal’s “peak 
factor”) will result in less audible distortion. This is indeed so: signals with a nearly 
constant a{t) suffer less from clipping. 

®To further analyze this astounding effect, one may decompose the input-output 
step function inherent in infinite clipping into a linear part and a remainder with 
no linear component. Inverse filtering the linear part gives the undistorted signal 
whereas the remainder is converted into an added Gaussian-like noise. 

If this explanation is correct, then distorting the signal by a symmetric input- 
output function which has no linear component, such as full- wave rectifying or 
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1.7 Frequency Division 

The misleading observation that the information of a speech signal resides in 
its zero-crossings has spawned some interesting ideas for speech compression. 
These schemes fall under the generic label frequency division [1.7]. In a typical 
frequency-division bandwidth compressor, the signal is filtered into several 
adjacent frequency bands, such as formant frequency regions for speech. In 
each channel the “carrier” frequency is halved (e.g. by eliminating every 
other zero-crossing and smooth interpolation) . The resulting signal is centered 
on half the original carrier frequency but does not in general have half the 
original bandwidth. (In fact, for an AM broadcast signal, halving the carrier 
frequency will not reduce its bandwidth. It will simply move to a different 
location on the radio dial, still emitting the same program.) Nevertheless, the 
frequency-halved signal is sent through a bandpass filter with half the original 
bandwidth. Thus, a total bandwidth compression by a factor 2 is achieved. 
At the synthesis end all received frequencies are doubled, for example by 
full-wave rectification (taking the absolute value) and smoothing. This is, 
roughly, how the so-called Vobanc works [1.8]. The resulting speech signal is 
intelligible but somewhat marred by “burbling” sounds. 

For each harmonic having its own filter channel (with a bandwidth of 
100 Hz or less), the distortion of a frequency compressor can be made prac- 
tically inaudible. This idea, dubbed Harmonic Compressor^ see Fig. 1.4 was 
first tested by B. F. Logan and the author by digital simulation in 1962 using 
a block-diagram compiler [1.9]. The demonstration was in fact the first large- 
scale application of digital speech processing, comprising the simulation of 
160 (Hamming-type) bandpass filters: 40 filters for separating the signal into 
adjacent frequency bands, 40 filters to remove the distortion from halving 
the bandwidth at the transmitter, 40 more filters for channel separation at 
the receiver and a final set of 40 filters for reducing the distortion inherent 
in frequency doubling. In spite of a very efficient filter design, the running 
time on an IBM 7090 computer was 530 times real time. But that was a 
small price to pay for a successful test that would have been prohibitive with 
analog circuits. 

The output of the analyzer, when speeded up by a factor 2 (by means of 
a two-speed tape recorder, for example), constitutes time-compressed speech 
with little audible distortion. The design of such a time compressor was made 
available by Bell Laboratories to the American Foundation for the Blind who 

squaring the signal, should give complete “rubbish” in combination with allpass 
reverberation and inverse filtering - which it does: the result is unintelligible noise 
that is not even speech-like. 

Implicit in these observations is a method to reduce the audible effects of non- 
linear distortion: Randomize the phase angles of the signal by a linear filter that 
has an inverse before applying it to the nonlinear device. Inverse filter the distorted 
signal. If high-frequency emphasis is used before distortion, the noise power can be 
reduced by lowpass filtering. 
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Fig. 1.4. Top: the spectrum of the fundamental frequency /o and its harmonics of 
a vowel sound. Bottom: the “compressed harmonics” after the frequency gaps be- 
tween the harmonics have been eliminated. The resulting signal can be transmitted 
over half the original bandwidth. By recording the frequency-compressed signal and 
playing it back at twice the speed, speech can be accelerated by a factor 2. Such 
a speech accelerator has been made available for the book-reading program of the 
American Foundation for the Blind. - By the inverse process (“harmonic expan- 
sion” ) speech can be slowed down with interesting applications in foreign- language 
acquisition and studies of aphasic patients 



had the compressor built for use in their recorded-book program for the blind. 
With this technique, blind people could “read” books at nearly the rate of 
sighted people. 



1.8 The First Circle of Hell: Speech in the Soviet Union 

The success of frequency division by a factor 2 soon led to schemes attempt- 
ing to compress the bandwidth by a factor 4 or more. Marcou and Daguet in 
France claimed a compression factor of 8 [1.7]. But careful analysis revealed 
that their signal was intelligible only because their filters were not sharp 
enough. Bona fide bandlimiting (with sufficiently steep roll-off bandpass fil- 
ters) destroyed the usefulness of the signal. 

Alexander Solzhenitsyn must have known all this when he wrote his First 
Circle of Hell The action takes place in a sharaga, a Soviet speech research 
laboratory peopled by political prisoners (zeks). One of the lab’s aims, dic- 
tated by Stalin personally, was to develop reliable voiceprinting so his security 
forces (NKVD, later KGB) could better identify dissidents and other critics 
of the regime from monitored phone calls. Another aim of the laboratory was 



1.9 Linking Fast Trains to the Telephone Network 



13 



speech compression so that speech signals could be digitally encrypted for 
secure transmission over existing telephone lines in the Soviet Empire. 

Apart from channel- vocoder techniques, frequency division schemes were 
vigorously pursued at the Soviet laboratory. After “successful” demonstra- 
tion of 4:1 frequency compression, prisoner Rubin moved on to a compression 
factor of 8. To test its viability, he summons fellow workers walking by his 
lab and reads to them, over his compressor, the first paragraph of the latest 
lead story from the Communist Party newspaper Pravda. Then he asks the 
listeners whether they were able to understand the text. The answer invari- 
ably is da. So Rubin decides to move on to a frequency compression ratio of 
16:1. 

To understand the supreme irony of Solzhenitsyn’s description, one has to 
know that useful frequency compression by a factor 8 is completely fictitious 
and to try 16:1 is outrageous. Of course, Solzhenitsyn knows that; he was a 
zek himself, working in just such a laboratory (as punishment for daring to 
criticize comrade Stalin’s leadership during World War II). 

But the irony goes deeper. As anyone who has ever glanced at a lead 
article in Pravda knows, the information content in the first 10 lines or so is 
practically nil, most articles beginning with the exact same words: “the Pre- 
sidium of the Central Committee of the Communist Party of the Union of 
Soviet Socialist Republics [i.e. the Politburo] and the Council of Ministers of 
the USSR. . . ” . Of course, Rubin is not consciously engaged in deceiving his 
superiors; he is simply a victim of self-deception, a common cause of being led 
astray by speech researchers - and of course not just speech researchers. Ru- 
bin has no doubts that his compressor produces intelligible speech - much as 
I once “proved” the usefulness of a correlation vocoder by counting (from 1 to 
10) over my new vocoder and asking innocent listeners whether they under- 
stood (answer: yes!) [1.10]. Of course, counting can be decoded by the human 
ear just from the rhythm of the utterance. (More formal tests conducted later 
revealed a surprisingly low intelligibility of the device - surprising perhaps 
only to the self-deceived inventor). 



1.9 Linking Fast Trains to the Telephone Network 

Special speech communication needs sometimes require unorthodox solutions. 
A case in point is the first public telephone link (in the 1960s) between 
a high-speed passenger train, running between New York and Washington, 
and fixed “ground” stations. Maintaining a constant telephone connection 
with a fast-running train requires a lot of signalling and switching between 
different “ground” stations along the track. In the original system there was 
no frequency space available for a separate signalling channel; the signalling 
had to be done “in band”, simultaneously with the ongoing conversation. 
But how can such fast, inband signalling be done without interfering with 
the speech signal? 
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Here an insight into the psychophysics of speech perception, originally 
gained with vocoders and particularly formant vocoders, came to the rescue: 
while the human ear is quite sensitive to the accurate center frequency lo- 
cations of the formants (“resonances”) of a vowel sound, it is surprisingly 
tolerant to formant bandwidth. Formant bandwidths may be increased by a 
factor three or more, from 100 Hz to 300 Hz, say, without such a drastic spec- 
tral change making much of an audible difference. This auditory tolerance to 
increased bandwidth implies that the ear cannot accurately “measure” the 
decay rate of formant waveforms. Speaking in the time domain (as people 
usually do), the time constant for the decay of a formant can be decreased 
by a factor three or more, from 10ms to Sms, say, without undue subjective 
distortion. 



ORIGINAL SIGNAL S{t) 
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Fig. 1.5. Top: voiced speech signal. Bottom: the same speech signal with a reduced 
envelope. Amazingly, the reduced-envelope signal, although highly distorted, retains 
a good degree of naturalness. The time gaps created by this kind of center clipping 
can be utilized to transmit other information, such as signaling information in a 
telephone system for high-speed trains 



Another way of speeding up the decay of a formant in a speech signal is 
to modify its envelope by subtracting an adjustable bias from it, setting the 
diminished envelope equal to zero if it falls below the bias. (This operation is 
equivalent to center clipping of the envelope^ rather than the signal itself.) The 
result of this envelope modification is a pitch-synchronous gating in which 
the low- amplitude parts of each pitch period are softly set to zero, thereby 
creating time gaps in the signal, see Fig. 1.5. As long as these time gaps do 
not exceed about one third of the pitch period, the gating, if done properly, 
is hardly audible. The resulting “silent” time slots can be used to transmit 
other information, such as switching signals [1.11]. 



1.10 Digital Decapitation 

Which is more characteristic of an individual speaker: the shape of his or her 
vocal tract and the motions of the articulators; or the characteristics of the 
vocal source! Certainly, the main difference between male and female speech 
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is the fundamental frequency, those of males being typically lower in pitch 
than those of females. But within a given gender, is pitch still more important 
than the vocal tract and articulatory motions? 

There is some circumstantial evidence favoring both alternatives. The 
importance of pitch patterns is emphasized by the following observation. I 
see a group of people talking to each other on a noisy street corner in New 
York City but I do not understand a word; there is too much traffic noise. But 
even through the din of traffic, I can hear the pitch patterns - and they are 
clearly Italian. (Later, on approaching the group, I discover they are actually 
speaking English with an Italian accent.) 

Evidence for the importance of articulatory motions again comes from 
foreign and regional accents. Articulatory motion patterns are “frozen in” at 
an early age (around 10 years) and are hard if not impossible to break in 
later life. Apart from pitch patterns, this is how we recognize the voice of a 
friend or guess the place of birth and early fife of an unknown speaker.^ 

To investigate the relative importance of pitch patterns and articulatory 
dynamics in characterizing an individual voice, John Pierce and Joan E. 
Miller separated the two effects by “cutting off peoples’ heads,” so to speak, 
and switching them around. All this was of course done on the computer 
[1.12], by means of digital simulation; no laws - local or otherwise - were 
violated. Miller and Pierce digitized speech signals from several speakers and, 
on a digital computer, made a pitch- synchronous spectral analysis of the 
utterances. The required pitch detection was done by hand by inspecting 
waveform printouts. The pitch patterns from one speaker were then combined 
with the results of the spectral analyses of the other speaker. The synthetic 
speech signal therefore contained elements of two different speakers: the pitch 
from Ronald, say, and the vocal tract (including the lips) from George. 

Naive (phonetically untrained) subjects then compared these synthetic 
voices with the natural ones, and for each synthetic signal had to make an 
identification with a natural speaker. To the surprise of some bystanders, the 
majority of the identifications were made on the basis of the vocal tract - 
not the pitch. 

This interesting pilot study did not systematically explore all possible 
parameters that characterize an individual voice, and there is room for con- 
siderable future research here. 



^Of course, articulatory motions are not the only movements that characterize 
an individual. One can often recognize an acquaintance from behind by his gait 
and a skier from the style of his turns. Skiing, dancing (and the playing of musical 
instruments) should be learned at an early age - just like walking. In fact, walking 
without stumbling or falling down, seemingly so simple, is such a complicated task 
that only persons with the highest degree of neural plasticity, namely babies, can 
learn it efficiently. 
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1.11 Man into Woman and Back 

The “head switching” experiment just described also bears on an exercise in 
speech synthesis in which the gender of the speaker is changed at will. In a first 
attempt to change a male into a female voice by digital manipulation, Bishnu 
Atal and I simply raised the pitch by one octave. The resulting “female” 
sounded suspiciously like a high-pitched male, confirming the age-old adage 
that there is more to the inequality between the sexes than pitch of voice. 

In a second attempt, we increased the formant frequencies by a constant 
factor to account for the shorter vocal tract of many females. But even this 
proved insufficient for a bona-fide female. We finally changed the individual 
formant frequencies independently with an algorithm that made allowance for 
other gender dissimilarities such as thinner cheek walls. The resulting syn- 
thetic voice was judged as clearly female by all listeners not familiar with the 
experiment. Yet, to the experimenters themselves, there remained a vestige 
of maleness, a soupgon of something deceptive. 

Recently, in addition to transmuting a male voice into female one, fusion 
of male and female into a single voice has been achieved - with pretty artistic 
consequences: for the sound track of the 1995 film Farinelli-Il Castrato the 
voice of the countertenor Derek Lee Ragin was “married” by computer to 
that of the soprano Ewa Mallas-Godlewka - best men: P. Depalle, G. Garcia, 
X. Rodet of IRCAM, Paris. 

1.12 Reading Aids for the Blind 

One of the earliest motivations for synthesizing speech from printed text 
were reading aids for the blind. Optical character recognition of fixed (or 
even mixed) type fonts has been possible for some time. Optical scanners 
are becoming ever more popular. But the conversion of graphemes (printed 
letters) to phonemes (spoken sounds) and natural sounding sentences is still 
difficult. There are problems at the lexical, syntactic, and semantic levels: 
The pronunciation of a word or phrase often depends on its prominence, its 
grammatical function, and its intended meaning. The selection of the proper 
prosody (intonation, segment durations, and stress patterns) is likewise de- 
pendent on syntax and semantics. Thus, the purely algorithmic synthesis of 
natural sounding speech from written text (without human “fine tuning”) 
remains an active field of linguistic research. 



1.13 High-Speed Recorded Books 

Another effort to help the blind in gaining access to printed texts is the 
recorded book program of the American Foundation for the Blind (AFB). 




1.16 Noise Suppression 17 



The AFB and the Library of Congress have long had programs in which 
trained speakers (radio announcers and actors) record books on magnetic 
tape so that the blind can listen to the books they can’t read. But there 
is a remaining handicap: reading aloud takes twice as long or longer than 
silent reading. To close this gap, as already mentioned, Bell Laboratories has 
made available to the AFB a technique that speeds up spoken utterances by 
a factor two without loss in speech quality [1.9]. 



1.14 Spectral Compression for the Hard-of-Hearing 

Speech analysis and resynthesis allows a wide range of spectral modifications 
in addition to the time compression and expansion just mentioned. High- 
frequency components that are difficult to hear, especially by older people, 
can be transposed in frequency to lower frequencies where they can be better 
heard. But learning to understand such recorded speech is difficult and may 
require special training [1.13]. 



1.15 Restoration of Helium Speech 

Spectral compression is also required for the restoration of the “Donald 
Duck” -like speech of divers who, to avoid the “bends,” breathe a mixture 
of oxygen and helium (rather than the nitrogen in ambient air). Since the 
velocity of sound in helium (being a lighter gas) is considerably greater than 
in nitrogen, the resonances of the vocal tract are shifted upward in frequency 
resulting in a strange and often unintelligible “quack-quack.” Spectral com- 
pression puts the resonances back in their proper place [1.14]. 



1.16 Noise Suppression 

A major aim of speech processing is the suppression of unwanted noises. 
Many noises have a broadband continuous spectrum, as opposed to voiced 
speech signals which have their energy concentrated near the harmonics of the 
fundamental frequency (pitch). Thus, a comb filter^ tracking the fundamental 
frequency and its harmonics, can reject much of a continuous-spectrum noise. 
Improvements in signal-to-noise ratio of 20 dB and more were realized in this 
manner by the author as early as 1956. However, the remaining noise has the 
same harmonic structure (and pitch-like sound) as the speech signal itself 
so that the human auditory processor can no longer exploit the spectral 
distinction between voiced speech and noise. 

Another approach to noise suppression relies on spectral analysis and 
synthesis: an estimate of the noise spectrum (obtained during silent speech 
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REDUCTION OF NOISE AND NON LINEAR 
DISTORTION IN SPEECH SIGNALS 




SPEECH 

OUTPUT 



Fig. 1.6. Noise reduction by spectral analysis of the signal and subtraction of the 
estimated noise level in each of 10 to 20 frequency channels. The noise spectrum is 
estimated during silent intervals in the speech signal. For voiced speech signals a 
reliable pitch extractor is needed 



intervals) is subtracted from the noisy speech spectrum, see Fig. 1.6. The re- 
sulting “cleansed” spectrum is used in the resynthesis of a noise-free speech 
signal. However, since the noise spectrum is never exactly known, the result- 
ing speech suffers from spectral distortion [1.15]. 

The application of this resynthesis method requires the accurate extrac- 
tion of the fundamental frequency from the noisy speech signal. Luckily, the 
cep strum method, originally developed to distinguish underground nuclear 
explosions from earthquakes,^ can cope with speech signals with signal-to- 
noise ratios near 0 dB. 



^Underground explosions, such as those set off by nuclear weapons tests, gener- 
ate compressional waves which travel faster than the shear waves emanating from 
earthquakes. Thus, the travel times (delay patterns) with which these waves arrive 
at the seismic recording stations contain important clues to distinguish nuclear from 
natural events. To measure these delay differences accurately, J. W. Tukey and B. 
P. Bogert invented the “cepstrum” (a neologism coined by Tukey). The cepstrum 
is defined as the Fourier transform of the logarithm of the power spectrum of a 
signal. (Without the logarithm, one would obtain the autocorrelation function with 
its known difficulties of disentangling delays and resonances.) 

When I first heard about the cepstrum, I realized its potential for accurately mea- 
suring the delay pattern of the vocal cord motions, i.e. the fundamental frequency 
of speech, without interference from the resonances of the vocal tract. 




1.18 Multiband Hearing Aids and Binaural Speech Processors 
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1.17 Slow Speed for Better Comprehension 

A device similar to that for speeding up speech (see Sect. 1.7) can slow it 
down by a factor of two or more. The resulting slow speech is helpful in foreign 
language acquisition (think of a fast foreign speaker), for communicating with 
the mentally retarded and for speech therapy for aphasiacs. 



1.18 Multiband Hearing Aids 
and Binaural Speech Processors 

For a “flat” (frequency-independent) hearing loss, simple amplification of 
all speech frequencies would compensate for the auditory damage. But most 
hearing impairments, be they inner-ear (sensorineural) in origin or poor sound 
conduction in the middle ear, show an increasing loss with increasing fre- 
quency. Thus, high frequencies have to be amplified more than low frequen- 
cies. The limited amplitude range (“recruitment”) that usually accompanies 
hearing loss requires amplitude compression to avoid overloading the ear. 
This fitting of the speech spectrum into the available perceptual “window” 
in the amplitude-frequency plane requires multi-channel hearing aids with 
independent amplitude compression in each frequency channel. 

Further improvements for listeners with residual hearing in both ears can 
be expected from binaural hearing aids that allow the listener to suppress 
noise from directions not coincident with the direction of the desired speech 
source. 

Even greater enhancements of speech understanding may be possible with 
binaural signal processors (“cocktail-party processors”) that suppress un- 
wanted noises by sophisticated algorithms implemented on wearable inte- 
grated circuits (“chips”) [1.16]. 

Yet, with all the advances - real or impending - in the art of hearing 
aids, there remains one major obstacle to achieve good hearing for hearing 
impaired people: upward spread of masking. In a quiet environment, even a 
high-frequency hearing loss of 60 dB is not fatal for speech understanding. 
But in the presence of intense low-frequency disturbing sounds, the middle 
and high frequencies, which are crucial for proper speech perception, are 
totally masked^ i.e. made inaudible. 

Unfortunately, masking engendered by low-frequency noise prevails in 
many places, not least in packed restaurants that lack sufficient sound ab- 
sorption (carpets, curtains, and acoustic tiles). Such eateries could be made 
quieter by smaller tables (or larger distances between tables) thereby improv- 
ing the “signal-to-noise” ratio (i.e. the loudness of the voices of people at the 
same table relative to those at surrounding tables). 
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1.19 Improving Public Address Systems 

Another evil for the hard-of-hearing are public-address systems whose low- 
frequency ( “bass” ) amplification has not been sufficiently turned down com- 
pared to the high-frequency ( “treble” ) setting. Since reverberation times in 
churches and lecture halls are generally longer at low frequencies, any low fre- 
quencies radiated into such enclosures play havoc with speech intelligibility 
because of reverberation in combination with upward spread of masking. This 
is particularly true for strong vowels (such as /a/) preceding weak consonants 
(like /f/ or /s/). 

To counteract this masking effect of reverberation, W. Meyer-Eppler has 
suggested a speech-activated switch that “kicks in” extra gain for weak high- 
frequency sounds. Unfortunately, the switching did more harm than good to 
speech intelligibility. (But this was in the 1950s using crude analog circuits. 
Perhaps a modern “soft” switch based on digital processing would do better.) 

The stable gain of public address systems can also be raised by frequency 
shifting; see the following section. 



1.20 Raising Intelligibility in Reverberant Spaces 

Of all room- acoustical parameters, reverberation time is the most important. 
It is defined as the time interval during which the sound energy in an en- 
closure (without new sound input) decays by a factor of one million, i.e. by 
60 dB. Good concert halls have reverberation times around two seconds, and 
somewhat longer for frequencies below 250 Hz. 

It used to be said that a reverberation time of one second was ideal for 
speech intelligibility in a lecture hall. In reality, the ideal is no reverberation 
at all. In fact, any echoes with delays exceeding 50 ms, i.e. those not integrated 
with the primary sound by the human auditory system, are deleterious. Of 
course, reducing the sound absorption in a given hall will raise the sound 
intensity for a fixed input power level. But the benefits in intensity or loudness 
are more than offset by the ill effects of increased reverberation. 

However, sound intensity can be increased without adding reverberation 
by well-designed public-address systems that radiate only mid- and high- 
frequency sound aimed at the audience (by directional loudspeaker columns) 
where it is quickly absorbed by clothing and hair.^ 

^To focus the sound into the required horizontal- fan pattern, the loudspeaker 
columns (“broadside” arrays in antenna lingo) have to be vertically positioned, with 
a slight tilt toward the audience if they are located (as they should be) somewhat 
above head level. Amusingly, in the Gottingen Stadthalle the loudspeaker columns 
are oriented horizontally (to better hide them inside a large central chandelier) . The 
predictable result: good speech intelligibility only at the few seats in the directions 
of the center normal of the three columns employed. - Apparently, people will go 
to any length (and width) to be unintelligible! 




1.21 Conclusion 



21 



To conceal the location of the electroacoustic sound sources and to en- 
hance the illusion that the sound emanates from the speaker’s lips, the sound 
from the loudspeakers should be delayed by about 10ms. This exploits the 
Haas effect which permits the sound intensity to be raised by some 10 dB 
without affecting the perceived direction of the sound. The first large-scale 
Haas-effect public- address system was installed in London’s St. Paul’s Cathe- 
dral. It employed multiple speaker columns with increasing delays. Its re- 
sounding success led to worldwide clones. 

To reduce the incidence of acoustic feedback ( “sing around” ) , I suggested 
frequency-shifting the speech signal by about 5 Hertz. This puts the energy 
at the peaks of the transmission response (where the feedback leads to insta- 
bility) into the valleys of the response (where the excess energy is rendered 
harmless) [1.17]. 

The first field application of frequency-shifting took place during the 
1963 AT&T shareowner’s meeting in Chicago, attended by 23 000 eager in- 
vestors. The frequency shifting (realized by a simple single-sideband modu- 
lation method) allowed an extra gain of about 6dB. But the most important 
advantage, according to Bell Laboratories’ sound engineers who operated the 
system, was the “soft-failure” behavior of the system with frequency shifting: 
instead of going from stable to unstable (“screeching”) within a fraction of 
a decibel of extra gain, the frequency-shifted system offered a 4-dB range 
(beyond the 6-dB extra gain) over which the system became progressively 
unstable (signalled by a 5-Hz pulsating modulation). 

With the advent of integrated circuits and the promise of elaborate signal 
processing, sophisticated dereverheration schemes have been suggested and 
tested [1.18]. These go far beyond the simple enhancement of weak unvoiced 
sounds as originally proposed. 



1.21 Conclusion 

Analysis and synthesis of speech signals, made possible by a better under- 
standing of human speech production and perception - and implemented with 
the aid of ever more powerful digital signal processing devices - has enhanced 
human speech communication and enlarged its reach. In the remainder of the 
book some of these themes will be taken up again in greater depth. 




2. A Brief History of Speech 



Descended from monkeys? My dear, let us hope that it is not true! 

But if it is true, let us hope that it not become widely known! 

The wife of the bishop of Worcester, 
hearing of Darwin’’ s Theory of Evolution 



In the Beginning was the Word. 



St. John 1.1 



Nothing could be more succinct about the power of words than John’s opening 
phrase of his gospel. The word reigned supreme at the creation and - for 
better or worse - has never lost its potency up to the present. 

By comparison, the scientific study of speech has had but a very brief his- 
tory. The following is a casual chronicle of some of the highlights of this 
history, from von Kempelen’s speaking machines to neural networks and 
wavelets. 



2.1 Animal Talk 

Sound communication - from love calls to warning cries - plays a pivotal role 
in the animal kingdom and the survival of species. Although animals do not 
“speak” in the human sense of the word (except in cartoons and fairy tales), 
their vocabulary of whoops, yells and roars often shows a remarkable variety, 
finely tuned to their specific needs. True, many a mynah bird or parrot pro- 
duces acoustically respectable speech sounds, but it is doubtful - to say the 
least - that they know what they are saying, see however Fig. 2.1.^ Along 

once sent an associate named Perry to a local pet shop to record the voice 
of a mynah bird also called Perry. Their “conversation” started out on the worst 
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TtJ^ rtfver •{jmiten anything, hut I think I have a earfitr “ 

Fig. 2.1. Parrots among themselves (when nobody seems to be listening) - over- 
heard by Leo Cullum of The New Yorker 



with the ability to produce meaningful sounds, some species, including hu- 
mans, developed the art of acoustic deception: think of the popular comedian 
emulating the voice of a vaunted president; the resourceful hunter imitating 
a bird call or stag roar; or certain cunning animals that ensnare a juicy prey 
(or partner) by forged sound signatures. 

Often the emphasis is on disguise rather than imitation: the wily kid- 
napper making his extortionist call; the foreign spy camouflaging his true 
identity; or the ravenous wolf softening his voice by ingesting chalk to better 
masquerade as mother goat in a well-known fairy tale. 



2.2 Wolfgang Ritter von Kempelen 

With this pervasive precedent in acoustic simulation (and dissimulation), it 
is perhaps small, wonder that man would one day attempt to fabricate human 
speech by machine. 



possible note when associate Perry thought his leg was being pulled when the 
bird introduced himself with “My name is Perry, what’s yours?” Next the noisome 
bird uttered “Merry Christmas,” whereupon the human Perry tried to convince his 

feathered namesake that is was almost Easter Before too long, their exchange 

was marked by utter confusion! 
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Fig. 2.2. Wolfgang Ritter von Kempelen’s 1769 speaking machine, as reconstructed 
by Sir Charles Wheatstone, comprising a pressure chamber (the “lung”), a vibrat- 
ing reed (the “tongue”), and a leather tube (the “vocal tract”). Von Kempelen’s 
contraption produced respectable vowel sounds but was widely disbelieved when ex- 
hibited throughout Europe: in a singular piece of poor timing, the Hungarian count 
had just demonstrated a “chess playing machine” that was exposed as fraudulent 



While today talking computers are ever more widely heard (and perhaps 
abhorred) , the first speaking “chip” was not etched in silicon but carved out 
of wood. More than 200 years ago, in 1769, Wolfgang Ritter von Kempelen, a 
Hungarian nobleman, began work on a mechanical speaking machine that is 
reported to have produced respectable speech sounds [2.1]. In 1791, after 20 
years of hard labor, von Kempelen published a book in which he described 
his observations on human speech production and his experiments with his 
speaking machine [2.2]. The essential parts of the machine were a pressure 
chamber for the lungs, a vibrating metal reed to act as the vocal cords, and 
a pliable leather tube for the vocal tract, see Fig. 2.2. By manipulating the 
shape of the tube, von Kempelen could produce many different vowels. For 
brief presentations these artificial vowels reportedly sounded quite realistic. 
For the production of plosive speech sounds, von Kempelen employed a model 
of the vocal tract that included a hinged tongue and movable lips, see Fig. 
2.3. 

While before von Kempelen the larynx was considered central to speech 
production, his simple and successful demonstration drew the attention of 
19th-century scientists to the vocal tracts the (nondecaying) cavity between 
the glottis and the lips as the main site of acoustic articulation. Unfortu- 
nately for the ultimate fate of his speaking machine, the clever count con- 
ceived and demonstrated a chess-playing machine at precisely the same time 
that he worked on his talking contraption. But his widely exhibited chess “au- 
tomaton” was anything but automatic: it concealed a midget chess master 
in its innards who communicated magnetically with the above-board chess- 
men. While the shifty machine (actually the hidden midget) is said to have 
beaten even Napoleon - no slouch at chess - the discovered deception cost von 
Kempelen his hard won credibility. People simply assumed that his speaking 
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Fig. 2.3. Von Kempelen’s model for 
the production of plosive sounds, such 
as b (as in bin) and d (as in din), char- 
acterized by a sudden pressure release 
at the lips (b) or tongue tip (d) 



machine, too, was just another fraud, merely conveying the speech sounds of 
a hidden human. 



2.3 From Kratzenstein to Helmholtz 

But von Kempelen’s travail was not for naught and his impact far from 
nil. His early forays into synthetic speech stimulated much research into the 
physiology of speech production and experimental phonetics. 

In 1779, the Imperial Academy of St. Petersburg (home to Euler for much 
of his prolific life) proffered its annual prize for explaining the physiological 
differences between the five long vowels: /a/ (as in part), /e/ (as in German 
Reh), /[/ (as in see), /o/ (as in German roh), and /u/ (as in fool); and 
for producing these sounds artificially. The prize was won by the physiologist 
Christian Gottlieb Kratzenstein, born in Wernigerode, in the Harz mountains 
(not far from the now defunct iron curtain). These resonators, see Fig. 2.4, 
were “energized” by vibrating reeds, much as in musical instruments. 

Of course, as we now know, the resonator shapes for a given sound are not 
unique, a fact exploited by the ventriloquist when he keeps his lips fixed while 
effecting (invisible) articulatory compensations with his tongue. This articu- 
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Fig. 2.4. Von Kratzenstein’s resonators for the five German vowels that won him 
the annual prize of the Russian Imperial Academy of Sciences at St. Petersburg in 
1779 

latory ambiguity - different shapes having identical resonances - was already 
pointed out by the British scientist W. Willis in the early 19th century.^ 

Willis clinched the connection between a specific vowel sound and the 
geometry of the vocal tracts as opposed to the shape of the articulators. He 
was able to synthesize different vowels by means of tube resonators (“organ 
pipes”). In the process, he made the important discovery that vowel quality 
depended only on the length of the tube and not its diameter. 

The noted British physicist Sir Charles Wheatstone (1802-1875), inventor 
of the Wheatstone bridge, elaborated Willis’ theory, pointing out that the 
“cavity tone” of the tube (vocal tract) was excited by one of the partials 
(Fourier components) of the reed source. 



2.4 Helmholtz and Rayleigh 

It was on this “Fourier” concept that the great German physicist and physiol- 
ogist Hermann von Helmholtz (1821-1894) built his vowel theory, included in 
his monumental oeuvre on auditory perception [2.4]. According to Helmholtz, 
the vocal tract acts as an acoustic filter enhancing those harmonics of the 
exciting puffs of air emanating from the glottis that lie near one of the filter’s 
resonances - a crucial concept, still current today. The frequency regions 
around these resonant frequencies were - and still are - called formants. 
Helmholtz, preferring spherical resonators with narrow necks over tube res- 
onators, was able to synthesize the vowels /a/, /o/, and /u/ with a single 
formant. But he needed two formants for /a/ (as in cat), /e/, and /i/. In 
his theory, Helmholtz associated different coexisting formant resonances with 
different parts of his resonators, the different parts acting as coupled oscilla- 
tors. 



^Amusingly, a method was recently described for synthesizing different diph- 
thongs all having the same (flat) power spectrum. This feat was accomplished by 
exploiting certain properties of monaural phase sensitivity, or waveform depen- 
dence, of human auditory perception [2.3]. 
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In parallel with the work of Helmholtz, L. Hermann developed a “puff- 
theory” of speech production according to which puffs of air excite the vocal 
tract to oscillations that decay over time. Just as Helmholtz’s viewpoint, 
stressing frequencies and resonances, was stimulated by his experimental tools 
- frequency-selective resonators - so was Hermann’s theory engendered by 
his apparatus: a phonograph with an Edison-cylinder that actually recorded 
these decaying oscillations and made them visible. 

While seemingly at odds, the two viewpoints of Helmholtz and Hermann 
represent the same physical reality, as emphasized by none other than the 
great Lord Rayleigh, the connection being mediated by a Fourier transfor- 
mation [2.5]. 

This parallelism between two apparently divergent theories is much the 
same as the equivalence between Heisenberg’s matrix formulation of quan- 
tum mechanics and Schrodinger’s wave equation: one theory is essentially 
the Fourier transform of the other. Heisenberg’s uncertainty principle is an 
immediate consequence of this Fourier relationship. 



2.5 The Bells: 

Alexander Melville and Alexander Graham Bell 

While speaking of Helmholtz and Rayleigh, we should not neglect to mention 
the work of two other giants: Alexander Graham Bell (1847-1922), and his 
elocutionist father Alexander Melville Bell [2.6]. During his childhood in Ed- 
inburgh, A. G. Bell, the future inventor of the telephone,^ had an opportunity 
to see and hear a reconstruction of von Kempelen’s machine by Wheatstone. 
Encouraged by his father, little Alexander proceeded to produce his own 
speaking machine, a replica of the human speech organs, complete with rub- 
ber lips and a wooden tongue [2.8]. 

Bell was stimulated also by Hermann von Helmholtz and especially by an 
article Helmholtz wrote around 1863 on the creation of “intelligent” sounds 
through the use of electrically driven tuning forks [2.9]. Bell’s limited grasp 
of German suggested to him that Helmholtz was talking about a “talking 
telegraph”. But, although Bell was later disabused, his intense interest in 
electrical transmission of speech did not abate one bit. 

In what must have been one of the drollest antecedents of modern telecom- 
munications, Bell even enlisted the help of his pet terrier [2.10]. He taught the 

^As early as 1860 Philipp Reis, a German professor, had constructed a premature 
telephone, which however was not ofhciahy recognized as capable of transmitting 
intelligible speech because it was basically an on-off switch activated by sound 
waves. Of course, we now know that one-bit (“inhnitely clipped”) speech can be 
quite intelligible, but a hundred years ago Reis did not prevail with his binary speech 
signal. To complete the irony, Reis’ device did produce multi-valued (“analog”) 
signals at very low amplitudes because his switch in fact constituted a variable 
resistance (called Wackelkontakt in German) [2.7]. 
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dog to sit up on his hind legs and emit a continuous growl. While the growling 
was going on, Bell changed the shape of the dog’s vocal tract by squeezing it 
(lightly) from the outside [2.11]. After mastering the vowels /a/ and /u/, the 
diphthong /ou/, and the syllables /ma/ and /ga/, the manipulated dog rose 
to new linguistic heights, growling complete sentences such as: “How are you 
Grandmama?” This feat, according to Bell, may have led to the rumor that 
he once taught a dog to speak. Needless to say, the itnattended dog did not 
utter a single word beyond the usual canine vocabulary. 

However, I once did encounter a dog that sang. But then synthetic 
“singing” is simpler than speaking - in the sense that it is easier for some ani- 
mals to emit a succession of notes of different pitches than to produce different 
speech-like sounds. This, incidentally, is true also for electronic speaking ma- 
chines: a singing computer, although perhaps unintelligible, is much more 
impressive to lay audiences than a speaking computer with its unpleasant 
electronic accent. In fact, the New York Times once reported that a speech 
research establishment in Dresden (in what was then East Germany) was 
far ahead of the West because their computer could even sing. Well, John 
L. Kelly and Garol Lochbaum at Bell Laboratories had a singing computer 
(intoning “Daisy, Daisy, give me your answer true. . . ”) way back in the early 
1960s [2.12]. 

2.6 Modern Times 

Helmholtz’s theory of vowel production gained further support from the work 
of G. Stumpf, O. G. Russel and Sir R. Paget [2.13]. Stumpf studied the spec- 
tral structure of vowels with an acoustic interferometer of his design - a Rube 
Goldberg contraption that occupied five rooms! With the same apparatus he 
was able to synthesize good-sounding vowels from the fundamental tones of 
28 fine pipes. In a refinement of Helmholtz’s finding, Stumpf showed that all 
vowels have at least two formants [2.14]. 

Russel’s contributions, following the work of E. A. Meyer [2.15], were 
the excellent x-ray images of the articulators that finally refuted the faulty 
physiology of earlier investigators and laid the ground work of modern x-ray 
analysis in speech research [2.16-18]. 

Sir Paget introduced whispered sounds into speech analysis, thereby elim- 
inating the confusing interference between the harmonic frequency compo- 
nents of voiced speech sounds and their formant frequencies [2.19]. (The ul- 
timate solution of this stubborn “pitch” problem did not emerge until the 
arrival of the cepstrum in 1962 [2.20].) Paget also constructed pliable vocal 
tracts made of wood, rubber, and deformable plastic. 

In the 1930s two Japanese researchers, J. Obata and T. Teshima, and the 
German musicologist E. Thienhaus discovered the third formant in vowels 
[2.22,2.23]. Thienhaus employed a new approach to spectral analysis - a 
superheterodyne method - spawned by emerging radio receiver technology 
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and invented by M. Griitzmacher, known in German as Sucht on- Analyse 
[2.23]. Griitzmacher’s method has become the method of choice for modern 
non-realtime spectrum analyzers, as embodied in sound spectrographs used 
in making “voice prints,” see Chap. 1. 



2.7 The Vocal Tract 

Building on previous progress, the human vocal tract soon assumed a cen- 
tral role in speech research. Figure 2.5 shows a simplified cross-section of 
the vocal tract for the vowel sounds /ee/ and /oo/ and the corresponding 
(smoothed) frequency spectra of the resulting speech sounds showing the 
three first resonances of the tract. Following the nomenclature of musicology, 
these resonances are called formants [2.24]. It is these formants and their 
movements that the ear perceives and the brain combines when we listen to 
speech. 

Note the different tongue position for /ee/ and /oo/. All vowel sounds are 
nicely distinguished, both in production geometry and resulting spectrum. 
This is the basis of human vowel production and perception. Of course, the 
vocal tract has more resonances at higher frequencies, but they are not as 
important linguistically as the first three. 

The main articulatory organs are the tongue (body and tip positions), the 
soft palate or velum that guards the entrance to the nasal tract, and the lips 
(opening area and degree of rounding). Among the different speech sounds 
the vocal tract can produce, some pairs are distinguished by the change in a 
single articulatory parameter. Such pairs of speech sounds are called minimal 
pairs, such as /ee/ (as in bee) and the German umlaut /ii/. The German /ii/ 






Fig. 2.5. Cross-sections through the 
human vocal tract for the vowels i 
(/ee/) and u (/oo/) and the result- 
ing frequency spectra. Note the dif- 
ferent tongue positions for these two 
speech sounds and the corresponding 
differences in the resonance (formant) 
patterns 
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Fig. 2.6. A minimal mechanical model of the vocal tract consisting of a plexiglass 
tube with square cross-section and a “tongue” mimicked by a plastic cube. A little 
loudspeaker at one end emitting quasiperiodic pulse trains plays the role of the vocal 
cords. By moving the plastic cube, different vowel sounds can be approximated. 
Partial closing of the opening at the end of the plexiglass tube (the “lips” ) changes 
an /ee/ sound to the German umlaut /ii/. - A steady, periodic pulse train as 
excitation signal makes the artificial vowels sound like an electric door buzzer. 
Only dynamically variable sounds are interpreted as speech. For human perception 
to “perk up,” sufficient contrast in a temporal or spatial stimulus is required. [For 
the eye looking at a steady scene, the required contrast is provided by involuntary 
rapid (“saccadic”) eye movements.] 

(equivalent to the Dutch or French pronunciation of “u”) can be produced 
from the articulation of the /ee/ sound by just rounding one’s lips while 
keeping the tongue immobile. This can be nicely demonstrated by a simple 
plexiglas model of the vocal tract, see Fig. 2.6. The teacher, by holding his 
hand over the lip opening of the model (and thereby decreasing its opening 
area), can change the /ee/ sound to a respectable /u/.^ 



2.8 Articulatory Dynamics 

How important the movements of the formants are as we speak is evidenced 
by the fact that a stationary vowel spectrum does not even sound like a 

^Considering that /ee/ and /ii/ are a minimal pair, it may be difficult to un- 
derstand why many native English speakers have so much trouble with the French 
pronunciation of “u,” as in “deja-vu,” which often comes out as “deja vous.” When 
alerted to their mistake, (in a French study class), the typical answer is “that’s what 
I just said: deja vous.” In other words, these English-speaking students of French 
apparently don’t even hear the difference between the French vowels ou and u - 
much like some Japanese have difficulty in perceiving^ let alone articulating, the dif- 
ference between r and /. Of course it is now well-known that linguistic distinctions 
that are not “absorbed” at an early age lead to dysfunctions later on. 
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speech sound. After having been held steady for several seconds (easy for a 
computer, but not for a human speaker), some “vowels” sound more like a 
buzzer or some other inanimate squawk box. On the other hand, short ut- 
terances, like “ba,” “da” and “ga” are perceptually distinguished by initial 
formant movements. In fact, as A. Liberman, F. Cooper and other researchers 
at Haskins Laboratories have shown, one can clip the initial consonants (“b,” 
“d,” or “g”) from the speech waveform (in the pre-computer age this could be 
accomplished by applying scissors to magnetic tape recordings of the utter- 
ances) and still perceive the correct consonant if only enough of the formant 
movement was kept [2.25]. 

Thus, if we want to study speech signals from a linguistic point of view, 
we need a device that portrays the spectral dynamics of the vocal utterance. 
Such a device, as already mentioned, is the sound spectrograph^ invented in 
the Second World War to analyze the voices of enemy radio operators and 
thus track their movements behind the front (an important clue in predict- 
ing imminent military offensives). Unfortunately, although the sound spec- 
trogram contains all the important linguistic information of a speech signal, 
it has proved impossible to teach people to “read” running speech that way 
- which would have been a great boon for the deaf [2.26]. 

Figure 1.1 shows the spectrogram of an utterance lasting about three 
seconds. Time increases along the abscissa to the right and frequency goes 
up along the ordinate. Spectral intensity is shown as an increasing degree 
of blackness. When embellished by contour lines of equal spectral intensi- 
ties, such sound spectrograms vaguely resemble fingerprints, which earned 
them the nickname voiceprints. However, this is a (deliberately?) mislead- 
ing designation because, for forensic purposes, the usefulness of voiceprints 
is limited [2.27]. 

One focus of modern speech research has been the relationship between 
articulatory dynamics and the acoustic speech signal. Articulatory motions 
can be studied by x-rays as has been practiced by Fant [2.18] and Fujimura, 
who invented an x-ray microbeam method [2.28]. A continuing challenge has 
been the derivation of vocal-tract area functions from the speech signal itself. 
Of course, this approach is afflicted by problems of articulatory ambiguity. 
However, this ambiguity can be defeated by measuring the impedance at the 
lips. Figure 2.7 shows a lip impedance measuring tube with a loudspeaker on 
the left and two closely spaced microphones on the right. The microphone 
outputs are fed to a computer which separates the incident sound wave from 
the sound wave reflected at the lips and calculates the area function computed 
from the lip impedance. Figure 2.8 compares the area function computed from 
a measured impedance function with an actual (test) area function. 

The kind of ambiguity encountered here in speech research occurs in 
many contexts and is characteristic of “inverse problems”. Thus, the Swedish 
mathematician (and later Rector of the Royal Institute of Technology in 
Stockholm) Borg discovered sometime ago that knowledge of the resonant 
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Fig. 2.7. Experimental setup for measuring the input impedance function at the 
lips. During the measurement the subject articulates (but does not phonate) dif- 
ferent speech sounds 
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Fig. 2.8. Area function, computed from a measured lip impedance function as 
shown in Fig. 2.7, compared with the actual (test) area function. The achievable 
spatial resolution is limited by the shortest measured wavelength (the highest mea- 
sured frequency) in accordance with the spatial sampling theorem 
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frequencies of a violin string is not sufficient to derive its mass density distri- 
bution [2.29]. But knowing two independent sets of resonant frequencies (for 
two different boundary conditions) would suffice (up to a spatial resolution 
determined by the highest known frequency). When reading Borg’s paper, 
it occurred to me that the frequency locations of the poles and zeros of the 
input impedance of the vocal tract, as measured at the lips, are equivalent to 
two such independent sets of resonant frequencies (for the open and closed 
tract, respectively) [2.30]. 

Another endeavor of current speech research is to investigate the func- 
tioning of the vocal cords and their interaction with the vocal tract [2.31]. 

So far I have stressed spectral properties of speech. But temporal aspects 
are just as important not only for prosody but even for vowel perception, as 
W. Endres has demonstrated. For example, the long vowel /ee/ (as in bee) 
can be changed to a short /i/ (as in bit) just by accelerating the rate of 
reproduction. 

The modulation transfer function measures the preservation of temporal 
features and is an important tool for predicting speech intelligibility [2.32, 
2.33]. 

2.9 The Vocoder and Some of Its Progeny 

In 1928, one- hundred- and-fifty years after von Kempelen’s wooden speaking 
machine, electronic speech coding was inaugurated by H. Dudley, an electri- 
cal engineer at Bell Telephone Laboratories. Dudley proposed to send speech 
signals over a new transatlantic telegraph cable with the (then) enormous 
bandwidth of 100 Hz. As mentioned in Chap. 1, he argued that speech was 
generated by slowly moving articulators and should therefore require only 
some 100 Hz total bandwidth for transmission. Since extraction of these pa- 
rameters proved difficult, Dudley suggested, as an alternative, the transmis- 
sion of the likewise slowly changing spectral information. The result was the 
frequency channel vocoder^ which is based on the spectral decomposition of 
sounds in the inner ear. Early vocoders had 10 to 16 channels. In 1956, the au- 
thor designed a nearly unintelligible 6-channel vocoder based on hearing-like 
lateral “inhibition” between adjacent channels. 

A somewhat more successful approach to reducing the number of channels 
in a vocoder (and therefore the amount of information to be transmitted) is 
based on the fact that adjacent frequency channels carry correlated informa- 
tion. Attempting to exploit these correlations Henry P. Kramer and Max V. 
Mathews asked themselves how best (in a minimum r.m.s. sense) to represent 
16 vocoder channels by, say, 8 signals. The answer is a linear transformation 
by a 16 X 8 matrix. Of course such a “contracting” matrix has no inverse. 
However, there is an optimum 8 x 16 matrix, called a pseudoinverse^ that 
recreates an approximation to the original 16 channels [2.34]. When actually 
implemented, it appeared that this 2-to-l compression scheme didn’t perform 
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any better (in terms of speech quality and intelligibility) than just suppress- 
ing every other channel. So the great invention was not further pursued. (But 
it did teach the author, and perhaps other bystanders, a bit of linear algebra 
and pseudo-inverses.) - In retrospect, the Kramer-Mathews scheme might 
have worked better if, instead of minimizing r.m.s. spectral error, it had been 
based on a perceptual error criterion. 

The “electronic accent” of speech synthesis can be reduced by choosing 
proper phase angles, thereby lowering the peak factor of the excitation signal 
[2.35]. Surprisingly, by manipulating the phase angles of sl flat- spectrum signal 
intelligible speech can be created [2.3]. 

As already mentioned, the channel vocoder was first used to scramble 
speech signals in the secret telephone link between Churchill and Roosevelt. 
Renewed efforts after the war to develop a vocoder for universal use foundered 
on the obstinacy of the pitch problem - the difficulty of extracting accurate 
pitch periods from running telephone speech signals [2.36]. In an ingenious 
dodge, the pitch problem was circumvented by the voice-excited vocoder 
(VEV), which allowed the transmission of high-quality (10-kHz bandwidth) 
speech over ordinary telephone lines; see Chap. 2.12. 

Another attempt at improving the quality of synthetic speech was the 
phase vocoder by J.L. Flanagan and Roger M. Golden [2.37]. 



2.10 Formant Vocoders 

Another method of reducing the number of signals to specify the spectrum 
of a speech sound focuses on the formant frequencies, the resonances of the 
vocal tract that show up as peaks in the spectrum. Early work on the difficult 
task of tracking the formant frequencies of running speech was described in 
James L. Flanagan’s thesis at MIT [2.38]. 

It so happened that the author’s first invention at Bell was a method of 
tracking formant frequencies by means of spectral moments in four subbands 
covering non-overlapping formant- frequency ranges. Since such moments can 
be measured in the time domain, the instrumentation is quite simple. (For 
example the first spectral moment of a signal is proportional to the average 
absolute slope of the signal.) 

Formant synthesis can be accomplished by a series synthesizer with three 
or more resonance circuits connected in series. Alternatively, the resonators 
can be connected in parallel. This allows greater flexibility because the ampli- 
tudes of the formants can be specified independently. However, as E. Weibel 
has emphasized, adjacent resonators must be connected with a sign reversal 
to avoid generating spectral zeros between the formants [2.39]. 

It is interesting to note that LPC all-pole synthesizers are implicitly also 
formant synthesizers (without the need to specify the formant frequencies!). 
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2.11 Correlation Vocoders 



According to a well-known theorem by Wiener and Khinchin, the autocor- 
relation function of a signal is the Fourier transform of its power spectrum 
(Chap. 10). Thus, the linguistically important spectral information is fully 
represented by the autocorrelation function. The question is how to resynthe- 
size a speech signal from the autocorrelation function. Converting a section 
of the autocorrelation function into a time signal by scanning it with the 
proper periodicity would create a signal whose spectrum would be the square 
of the original spectrum. 

To undo this squaring, the author proposed taking square roots in three 
or four frequency channels [2.40]. The resulting autocorrelation vocoder was 
a limited success (to put it mildly), but it did produce something vaguely 
speech-like at its output. (It took the advent of linear predictive coding 
(LPC) several years later, to teach the proper way of going from autocorre- 
lation sequences to time signals, namely by matrix inversion.) 



2.12 The Voice-Excited Vocoder 

Frequency-channel vocoders were originally invented to compress the trans- 
mission bandwidth required by speech signals. In 1957 John R. Pierce asked 
me to apply the vocoder principle to improve the quality of speech signals sent 
over ordinary telephone lines by compressing a wide band ( “high-hdelity” ) 
speech signal into the 3-kHz telephone band. Given that vocoders had the 
bad habit of lowering speech quality, this seemed a tall order indeed. Even 
in the unlikely case that a “good” pitch detector could be designed - good 
enough to produce telephone-quality speech - to attain the intended superior 
speech quality via a vocoder (or any other speaking machine) seemed utterly 
beyond the state of the art in the 1950s. 

Nevertheless, I thought that the goal of a high-hdelity vocoder, cover- 
ing frequencies up to 10 kHz, could perhaps be realized by transmitting part 
of the speech signal, a so-called baseband^ uncoded. For a baseband cov- 
ering frequencies up to 2 kHz, the remaining I kHz bandwidth of a 3-kHz 
telephone band could be used to transmit the vocoder channel information 
for the speech frequencies between 2 kHz and 10 kHz. With six channels of 
constant relative bandwidth, each channel would correspond to about two 
critical frequency bands of human hearing. Such partial vocoders are also 
called semi-vocoders. 

The pitch accuracy problem for a semi- vocoder is even more stringent than 
for full-band vocoders due to the requirement that the pitch in the coded part 
of the spectrum (2 kHz to 10 kHz) should match exactly the natural pitch be- 
low 2 kHz to avoid talking with two “voices” at the same time. To “solve” this 
impossible pitch problem, I resolved to circumvent it. Specifically, I decided 
to use as the excitation signal for the frequency components above 2 kHz a 
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heavily distorted version of the baseband available at the receiver. Center 
clipping or cubing are well suited as nonlinearities. The spectral flatness of 
the excitation signal can be enhanced by a multi-channel equalizer. 

Vocoders using an excitation function derived from (a subband of) the 
speech signal without pitch detection were called voice- excited vocoders or 
VEVs [2.41]. Voice excitation guarantees the complete coherence between the 
original speech (in the baseband) and the coded portion of the spectrum. The 
first voice-excited vocoder was called by critical listeners “the first vocoder 
that sounded natural, like a human speaking” without the “electronic accent” 
of earlier vocoders that relied on pitch detection.^ (One of the test utterances 
I used for the VEV was a short dialogue in a Rhenish dialect, replete with 
fricatives, concerning a child licking the frost figures off the window pane of 
a street car. Here it is in full:“Daaf dat dat dann? - Dat daaf dat! - Dat dat 
dat daaf!!”) 

The success of the voice-excitation principle taught speech researchers an 
important lesson, namely that there is more to the excitation function than 
just a fundamental frequency and a dichotomous voiced/unvoiced distinction. 
The lengths of successive pitch periods do not follow a smooth course over 
time; there are small “random” fluctuations, and there is a continuous range 
of excitations between completely unvoiced and short-time periodic - not to 
mention voiced fricatives, such as /z/ as in buzz in which the turbulent energy 
pulsates in synchrony with the pitch leading to a modulated-noise excitation. 

As related before, further progress on the pitch problem was made possi- 
ble by the cepstrum method, originally invented to distinguish underground 
nuclear explosions from earthquakes. 

There are many applications of vocoders besides speech scrambling and 
bandwidth compression, applications such as noise suppression and the 
restoration of helium speech. Time domain processing has also been em- 
ployed, based on pitch- synchronous gating and frequency division originally 
proposed by H. Seki. Frequency compression factors of 2 to 1, while pre- 
serving high speech quality, where demonstrated by a method called analytic 
rooting [2.42] and by harmonic compression [2.43]. Claims of compression 
factors as high as 8 and, facetiously, 32 (by A. Solzhenitsyn in his sarcastic 
The First Circle of Hell) were shown to be fallacious; they were based on 
bandpass Alters with insufficient out-of-band frequency rejection. 



2.13 Center Clipping for Spectrum Flattening 

As we have seen, amplitude compression, even in its extreme form of infinite 
clipping, does not destroy speech intelligibility. By contrast, center clipping^ 

^Before voice-excited vocoders, people, jocularly, distinguished between two 
kinds of vocoders: those that were intelligible but sounded inhuman (like chan- 
nel vocoders) and those that sounded human but were unintelligible (like early 
formant vocoders). 
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defined as setting the central amplitude values of a signal (within a given 
threshold range) equal to zero while leaving the signal amplitudes outside 
this range unchanged, renders a speech signal virtually unintelligible for high 
enough threshold settings, say 90% of a running peak value. Such severe 
center clipping eliminates most of the oscillatory waveform characterizing the 
formants. As a result, the spectrum of severely center-clipped speech lacks 
the typical formant structure and has a roughly fiat spectral envelope. Yet 
the fine structure of the spectrum is preserved. Such spectral flatteners are 
useful in voice- excited vocoders to generate a flat-spectrum excitation signal 
coherent with the uncoded speech signal. 

Another, “softer” operation for spectrum flattening is cubing the speech 
signal. Since cubing (or distortion by any other homogeneous power law) is a 
scale free operation, it does not require the setting of a threshold, which can 
be tricky. 



2.14 Linear Prediction 

A new area in speech analysis/synthesis dawned in 1967 with linear predic- 
tive coding (LPC), based on an all-pole model of speech signals [2.44]. LPC 
is used in practically all of today’s most successful speech analysis/synthesis 
systems. With LPC, formant-like analysis was made possible without encoun- 
tering the difficulties of formant frequency measurements inherent in formant 
vocoders [2.45]. Another great boost was given to LPC by the method of par- 
tial correlations pioneered by Itakura [2.46]. 



2.15 Subjective Error Criteria 

Minimizing the perceived quantizing noise (rather than the root-mean-square 
error) by introducing the masking properties of the human ear into the coding 
process was proposed by the author in the 1970s [2.47]. This permitted bit 
rates below 1 bit/sample for the prediction residual [2.48]. In combination 
with code-excited linear prediction (CELP), rates as low as 0.25 bits/sample 
were achieved while maintaining high speech quality and full intelligibility 
[2.49]. 

Based on these achievements, once thought unattainable, LPC has become 
the preferred method for speech synthesis in innumerable applications from 
talking toys to spoken message services. It is interesting to note that LPC is 
closely related to the Maximum Entropy Principle, first used in geophysics 
(for oil prospecting) and in astronomy (for image enhancement). 
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2.16 Neural Networks 

Neural networks (NN), have been applied successfully in speech recognition 
and speech synthesis from text [2.50]. In fact, T. Sejnowski succeeded in 
demonstrating the learning process of a NN talker, which went from baby-like 
babble to more mature speech as it improved its own diction by “backward 
propagation” of corrective instructions. Their potential for speech coding, 
however, has not been sufficiently explored. 

Neural networks have, however, been used to advantage in automatic 
speech recognition [2.51] as an alternative to hidden Markov models [2.52]. 



2.17 Wavelets 

Wavelets have a long history. As early as 1909 Alfred Haar, a student of David 
Hilbert at Gottingen, submitted a Ph.D. thesis in which he proposed binary- 
valued wavelets, published a year later in the Mathematische Annalen^ vol. 69 
(“Zur Theorie der orthogonalen Funktionssysteme” , pp. 331-371). Another 
binary transformation that has long been used successfully in signal analysis 
is based on Walsh-Hadamard matrices. 

In recent years, wavelets have become very popular in signal processing, 
including speech analysis/synthesis [2.53]. 

There are two broad classes of wavelets: those that “live” on a linear 
frequency scale and those whose frequencies scale logarithmically, called affine 
wavelets. Linear scale wavelets correspond, roughly, to the impulse responses 
of the individual filters of a “linear” filter bank, i.e. a bank of filters with 
contiguous constant-bandwidth filters. 

Affine wavelets, by contrast, correspond to an “octave” or “third-octave” 
filter bank in which each filter has a constant relative bandwidth. More pre- 
cisely, scaled wavelets^ as they are also called, have waveforms that are de- 
rived from a single prototype by repeatedly applying a scale factor (“multi- 
resolution wavelets”). For “octave wavelets” the scale factor is 2, but other 
scale factors - and indeed variable scale factors - may be preferable for a 
good representation of speech signals and optimum subjective (hearing re- 
lated) error criteria. Scaling is one of the fundamental concepts of nature that 
governs many of its laws and designs from fractals to power laws [2.54]. 

Affine wavelets are particularly germane to speech signals and synthesis 
for two important reasons: 

1. They mimic the (nearly) constant-Q character of the resonances of the 
human vocal tract. 

2. Above about 800 Hz, the frequency analysis of the human ear is ap- 
proximately logarithmic, i.e. the analysis is in terms of constant relative 
bandwidth. For normal hearing the relative bandwidth is around 0.15 
corresponding to a constant Q of about 7. 
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Viewed as a logarithmic filter bank, the human auditory “filter bank” has 
therefore a roughly constant scaling factor of 1.15 which corresponds approx- 
imately to the fifth root of 2. A set of wavelets covering the speech bandwidth 
250-4000 Hz would therefore have 39 members. Of these, the lower 7 chan- 
nels can be combined into 3 channels of approximately constant bandwidth 
of 120-160 Hz. This would be in accordance with the spectral analysis in the 
inner ear, which is on a nearly linear frequency scale below 800 Hz. 

It is not clear at the time of writing whether wavelets will in fact become 
the wave of the future for speech signals. More research is needed to optimize 
the various options that wavelets offer (basic waveform, sampling grid in time 
and frequency, bit allocations, etc.). 



2.18 Conclusion 

Computer speech can look forward to a promising future, both in terms of 
challenging research and useful applications, to wit: 

- efficient storage and transmission 

- mobile communication 

- spoken message services 

- “real audio” on the World Wide Web 

- maneuvering of wheelchair by the voice of the handicapped 

- attitude control of space capsules by astronauts 

- Internet speech security 

- educational tools 

- foreign language acquisition 

- reading aids for the blind 

- reduction of noise and reverberation 

- talking cars, cameras, and toys 

Additional applications, agreeable or noisome, are surely waiting in the wings. 




3. Speech Recognition 
and Speaker Identification 



Civilization advances by extending the number of important operations which 
we can perform without thinking. 

Alfred North Whitehead 



If anything can go wrong, it will. 



Murphy Law 



In this chapter we discuss, in an informal manner, some of the successes and a 
few of the outstanding problems of automatic speech recognition (ASR) and 
speaker identification - for forensic, business and banking purposes. ASR 
can also help the hard-of-hearing by giving them printed text to read, and 
the wheelchair-bound by allowing them to control their vehicles by voice. 
Together with speech synthesis from text, human-machine dialogue systems 
offer attractive possibilities for all manner of information services. 

Speech recognition is basically a pattern matching process. The objective 
in pattern matching is to compare an unknown test pattern with a set of 
stored reference patterns (“templates”), established from the training data, 
and to provide a set of similarity scores between test and reference pat- 
terns. The development of statistical pattern matching techniques based on 
dynamic programming and on hidden Markov models (HMMs) represents 
a major breakthrough in automatic speech recognition. Because of its sta- 
tistical nature and its simple algorithmic structure for handling the large 
variability in speech signals, these methods have found widespread use in 
automatic speech recognition. 

Neural networks and Kohonen maps are likewise among the promising 
techniques employed. 

Another important ingredient of automatic speech recognition that has 
come to the fore recently is the enhancement of the modulation spectrum. 
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It has been known for some time that modulating frequencies (i.e. the fre- 
quencies at which the amplitude or envelope of a speech signal fluctuates) 
peaks at about 4 Hz at a normal rate of speaking. Thus a modified speech 
signal in which the modulation frequencies around 4 Hz are enhanced is more 
intelligible to a human listener in the presence of noise. ^ And what helps hu- 
man listeners also helps computers to better understand speech from noisy 
or reverberant environments. 



3.1 Speech Recognition 

The automatic recognition of spoken language and its transcription into read- 
able text has been a long-held dream. I wish I could dictate this book into an 
automatic speech recognizer rather than laboriously tapping it out with my 
two index fingers. Of course many people would miss the human typist bright- 
ening up the office as an intelligent working partner. But still, automatic 
speech recognition has many practical applications including, ultimately, the 
voice-typewriter [3.1]. 

Human communication by voice appears to be so simple that we tend 
to forget how variable a signal speech is. In fact, spoken utterances even of 
the same text are characterized by large differences that depend on context, 
speaking style, the speaker’s dialect, the acoustic environment, microphone 
characteristics, etc. In fact, even identical texts spoken by the same speaker 
can show sizable acoustic differences. Automatic methods of speech recogni- 
tion must be able to handle this large variability in a fault-free fashion. 

In addition to phonological and lexical information, good automatic 
speech recognizers also rely on the grammar of the language to be recognized 
and - as far as possible - on the semantics of the text, the potential meaning. 
How little grammar without meaning can accomplish is nicely demonstrated 
by Chomsky’s grammatically correct nonsense sentence “Colorless green ideas 
sleep furiously”. (But nonsense, albeit not as obvious, remains a favorite de- 
vice of the politician - and the professional double talker.) 

Not surprisingly, the success of automatic speech recognition depends 
on the language. Just think of the names of the Old Continent in Italian 
and English. In Italian, the pronunciation of Europa is allotted 4 clearly 
enunciated syllables: E-u-ro-pa. By contrast, the English Europe has at best 
just two syllables (the second of which does not even contain a full vowel). 

Automatic speech recognition also depends critically on the specific task. 
Thus the recognition of a few words from a small vocabulary, spoken in 
isolation, preferably by a “master’s voice,” has been within reach for decades: 



^Indeed such modulation enhancement is now routinely practiced in better hear- 
ing aids. Interestingly human sensitivity to modulation frequencies (in amplitude- 
modulated tones, for example) also peaks around 4Hz, suggesting some degree of 
coevolution of human speech and hearing. 
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witness the toy dog “Rex” of yore who wagged his tail in recognition (!) 
when addressed as “Rex.” (Actually, the mechanical marvel responded in the 
same manner to any loud enough sound or noise - so much for early speech 
recognition.) 

However, stuffed animals - stuffed with electronics - have become much 
wiser recently and even outright dangerous. Thus Furby, the fluffy phantasy 
toy, according to a report in the Washington Post^ has forced the National 
Security Agency in Fort Meade, Maryland, to issue a “Furby Alert”. It ap- 
pears that NS A employees had “smuggled” the high-tec cyberpets into the 
supersecret agency and officials are naturally worried that people would take 
the Furbys home with them and the stuflPed pets would start talking and 
divulge all manner of highly sensitive stuff {International Herald Tribune^ 
January 14th, 1999). 

Another precocious speech recognition system was Bell Laboratories’ Au- 
tomatic Digit Recognizer, dubbed “Audrey,” intended for voice dialing [3.2]. 
After a brief training session on a new voice, it would dial correctly most of 
the numbers much of the time - but rarely a complete seven-digit number 
correctly. In the meantime, fairly reliable voice-dialing (in a moving car, for 
example, see Fig. 3.1 ) is accomplished by pronouncing pre-selected distinc- 
tive words like “home” or “office” or “Anny,” instead of digit strings. 

Speech sounds can be described in terms of their distinctive features, such 
as nasality and voicing. In theory, speech recognition could be based on first 
recognizing distinctive features and then inferring the sounds themselves from 
an appropriate dictionary. However, distinctive features in running speech are 
often difficult to pin down. A better way for automatic speech recognition is 
to consider speech signals as stochastic sequences and treat them by statis- 
tical pattern recognition techniques incorporating linguistic constraints. An 
example is Bayes’ rule, which gives a posteriori probabilities in terms of a 
priori probabilities: 

p{W\A)=p{A\W)^p{W)/p{A) . 

Here p{W\A) is the a posteriori probability that a given word W was spoken 
given an acoustic input (sound wave) A. It depends on the conditional proba- 
bility p{A\W) for an acoustic input A given a word VF, which can in principle 
be measured. The factor p{W) reflects the different a priori probabilities of 
different words. When looking for the most likely word W going with a given 
acoustic input A, the factor p(A) becomes immaterial. 

Speech recognition algorithms can be used in either a “speaker-dependent” 
or “speaker- independent” mode. As the size of the vocabulary grows and the 
circle of talkers is widened, reliable speech recognition becomes more difficult. 
If words are not pronounced in isolation but strung together into fluent, con- 
versational speech - if there are background noises, echoes and reverberation 
- useful speech recognition soon strains present capabilities. In isolated words, 
or speech where words are separated by distinct pauses, the beginning and 
end of words are more clearly marked. In continuous speech, word bound- 
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"^Alisony />kme phone Mr, Bradshaw and ieii him / won V 
he avoilahle for lunch. Then call the Coast Guard and tell them 
Pm drowning near Long Branch, blew Jersey , " 

Fig. 3.1. Modern mobile communication: the proper priorities of last calls as seen 
by J.B. Handelsman, The New Yorker 



aries are blurred and automatic methods of segmenting continuous speech 
into words must be devised - not a trivial task. Machine recognition of con- 
tinuous speech with a large vocabulary requires that syntactic and semantic 
constraints be incorporated in the recognition process. 

However, in spite of these difficulties, interest in speech recognition and 
speaker identification refuses to wither and research continues vigorously. Af- 
ter all, the automobile, originally conceived as an ersatz horse, runs tirelessly 
- although it still balks at leaping fences. And the airplane, too, inspired 
by the swift bird, has grown in size and importance without ever having 
mastered the fine art of alighting on a tree. 



3.2 Dialogue Systems 

Among the numerous applications for automatic speech recognition perhaps 
the most (im) pertinent are task-oriented dialogue systems, such as travel in- 
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formation services. These require of course (limited) speech synthesis as well 
as recognition. One prominent example is the Philips Train Timetable Infor- 
mation System, which has been running in Switzerland and the Netherlands 
since 1996 [3.3]. Another, earlier, example is the AT&T Bell Laboratories 
“Conversational Mode Speech Understanding System” by S. E. Levinson and 
L. R. Rabiner [3.4]. Here is a brief dialogue with the Bell system (not the 
long-defunct Bell System): 

Customer: I want one first class seat on flight number 31 to Denver on 
Sunday. 

Machine: I am sorry the flight is not available on Sunday. 

Customer: I want to go on Saturday then. 

Machine: Flight number 31 leaves New York at 09:45 and arrives in Denver 
at 13:10. Etc. 

The machine can, reportedly, also cope with some non-sequiturs and non- 
sensical questions by the customer (without losing its temper). 



3.3 Speaker Identification 

Speech recognition’s sibling, automatic speaker identification, too, has many 
potential applications. I remember, as mentioned, a visit to my Murray Hill 
office, in 1965, by a delegation from the American Bankers Association who 
wanted to know the chances of replacing payment by paper check by voice- 
actuated money transfer. The customer’s voice was to take the place of the 
signature on the check. When I pointed out the unreliability of automatic 
speaker verification, they were not discouraged: North American banks were 
losing (I forget how many) millions of dollars every year owing to forged or 
illegible signatures - or no signatures at all. So a certain “false accept” rate 
was quite acceptable to the bankers. But 33 years later, in 1998 - in spite 
of sizable inroads by electronic money transfers - Americans still wrote 70 
billion paper checks. 

Speaker identification or verification could also be of crucial importance in 
allowing (or denying) access to restricted data or facilities. Think of confiden- 
tial medical reports or bank statements. A crazy colonel could conceivably 
start a war by pretending to be someone much higher up in the chain of 
command. In World War II, speaker identification (by visual inspection of 
spectrograms) was used to track the movements of German radio communi- 
cators, thereby allowing the Allies to anticipate forthcoming enemy forays. 
This was the first “field” application of the sound-spectrograph and “visible 
speech” [3.5]. 

Beyond verifying a given speaker, identifying his accent or dialect is some- 
times the goal. Again, I remember a visit, this time by a pair of “spooks” 
from Virginia. They were eager to learn whether it was possible to build a 




46 



3. Speech Recognition and Speaker Identification 



machine that could identify the dialect of an unknown voice. They brought 
a secret recording, taped in a bar in Rio de Janeiro, of a Russian-speaking 
voice and they wanted to know whether a machine could tell if it had an 
Odessa accent. (I’ll spare you my answer, which is “top secret” anyhow.) For 
another breakthrough, see Fig. 3.2. 




.friothrr Break through fre^m A. 1\ ^ 7'. 



Fig. 3.2. The latest advance in reliable communication, witnessed by J.B. Han- 
delsman. The New Yorker 



3.4 Word Spotting 

To stay with the Cold War for a while, the Soviets, too, were not sitting on 
their hands or ears. In fact, they excelled at a speech recognition task called 
word spotting. As is well known, the new Soviet embassy in Washington had 
been erected in a strategic position, directly in the path of a major U.S. 
microwave highway, affording the Russian “diplomats” an easy opportunity 
to snoop on untold toll calls. Of course, you can’t listen to thousands of 
conversations. But you can design a speech recognizer that pricks up its ears 
at the occurrence of certain key words, such as “wheat” or “wheat price” 
and only then record the conversation. It is said that in this manner the 
Soviets were able to strike a very advantageous wheat deal when another 
food shortage was threatening the motherland. 
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Stalin, it appears, had no small interest in speech research. As mentioned 
before, he supported speech recognition and speaker identification originally 
to be able to trap “traitors” on the telephone. The Generalissimo and his 
surviving military minions were also interested in speech compression because 
it facilitated digital speech scrambling. All this is expertly told by one of the 
prisoner-scientists involved, Alexander Isayevich Solzhenitsyn in his The First 
Circle of Hell 

While visiting the Soviet Union in 1963, as a guest of the Committee 
for the Coordination of Scientific Research, I was able to get a fair sample 
of Soviet capabilities in speech research but, curiously, one of their main 
laboratories {Lab 26 I think it was called) was “closed for repairs” during my 
visit. ^ 



3.5 Pinpointing Disasters by Speaker Identification 

My first encounter with the usefulness of voice recognition was in 1956, when 
two airliners collided over the Grand Canyon. There was a last message, 
just before the crash, from one of the planes ending in the words “We are 
going in. . . ” After that: Silence. The Federal Aviation Authority surmised 
that the speaker had just seen the other plane and was crying out his fateful 
discovery. But who was the speaker? The answer would identify the position 
of the speaker in the cockpit and therefore the probable direction of the other 
plane. Careful analysis of the spectrogram of the unknown voice by L. G. 
Kersta revealed that it matched the characteristics of the flight engineer. This 
modest piece of information helped the Authority to reconstruct the course 
of the collision. Subsequently, the FAA issued orders aimed at forestalling 
future accidents of this type. 

Another tragedy that caught the world’s attention was the burning up 
of three U.S. astronauts, Virgil Grissom, Edward White and Roger Chaffee, 
on the ground during a training session on 27 January 1967. Again I was 
at the receiving end of a horrible tape recording: the last words of a human 
being engulfed in flames. The voice screamed, at a pitch exceeding 400 Hz, 

was, however, able to meet the speech research people in another location 
where I was asked to preface my scheduled talk with a brief description of Bell 
Laboratories. I started out by saying “Bell Labs is the research and development 
arm of the American Telephone and Telegraph Company. We are 14 000 people” 
- which was translated as “u nikh 14 chlen” (they have 14 members). Whereupon 
I cried in desperation (forgetting that I wasn’t supposed to understand Russian) 
“nyet, 14 tysyacha (14 000) chlen.” The audience burst into laughter, but I felt 
sorry for the interpreter knowing from painful personal experience how difficult 
translating can be. 

Incidentally, after my talk several Soviet colleagues commented that “I was the 
first American they could readily understand.” Well, speaking with a sound German 
accent - and being very much aware of the difficulties of transcending language 
barriers - I was not surprised. 
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“Fire! We’re burning up!!” Whoever said those words probably saw the fire 
first, implying that it had started on his side. But whose voice was it? The 
screaming had distorted it beyond human, let alone machine identification. 
But spectral analysis allowed us to identify the screamer and helped NASA to 
take corrective action. (This included replacing the highly flammable pure- 
oxygen breathing atmosphere by a safer mixture of oxygen and nitrogen. 
The Russians had been able to do this much earlier in their space program 
because their rockets were more powerful and could carry the required greater 
payload.) 



3.6 Speaker Identification for Forensic Purposes 

With these successes in voice identification it is not surprising that linguists 
soon thought of enlisting spectral analysis for forensic purposes. The main 
interest was in identifying the voice of an extortionist or suspected criminal. 
Before long, sound spectrograms were christened “voice prints” by those eager 
to sell the new “art,” the implication being that they were as reliable as 
fingerprints. 

To keep the discussion on safe scientific ground, the Acoustical Society of 
America formed a committee of speech experts to look into these claims [3.6]. 
The main conclusions of the committee’s report emphasized that a suspect’s 
voice could sometimes be excluded with certainty on the basis of incompatible 
spectral data. In other words, the suspect, given his or her vocal apparatus, 
could have never produced all the features of the given utterance. Further- 
more, a voice could sometimes be “identified” with some probability from 
a limited pool of potential candidates. But all bets were off, the committee 
concluded, for the identification of a voice from an open ensemble of speak- 
ers. Voiceprints are just not as uniquely characteristic of a person’s identity 
as the genetic code (DNA) or fingerprints - notwithstanding the entry in 
the American Heritage Dictionary of the English Language (Third Edition, 
1992): 

voiceprint (vois^ •print’) noun. An electronically recorded graphic 
representation of a person’s voice, in which the configuration for 
any given utterance is uniquely characteristic of the individual 
speaker. 

The Random House Dictionary of the English Language (Second Unabridged 
Edition, 1978) has a more appropriate definition; its entry under voiceprint 
reads: 



A graphic representation of a person’s voice, showing the compo- 
nent frequencies as analyzed by a sound spectrograph. 
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3.7 Dynamic Programming 

Automatic recognition systems contain, in their memories, reference patterns 
or templates of the words to be recognized. These templates may be a suc- 
cession of amplitude spectra (on a linear frequency scale or a more ear-like 
progression) or they may be cepstra or any other set of parameters that 
characterize speech sounds. These templates usually imply a fixed time scale. 

An utterance to be recognized has of course its own time scale that rarely 
coincides with that of a template. Thus, the problem of proper alignment 
in time between unknown utterance and template becomes paramount. In 
discrete time, at every time step, the question arises of whether the un- 
known utterance is lagging behind the template, just in step, or ahead of 
the template. Depending on the answer, the template sample is held in place 
or advanced by one or several steps. This process of “clock comparison” or 
time registration, called dynamic time warping or dynamic programming^ is 
a crucial ingredient of most automatic recognizers. 

Dynamic programming is also used as search strategy for longer speech 
segments including entire words [3.7]. In another approach, which avoids the 
problem of time alignment, the spectrograms of entire words are recognized 
without cutting them up into shorter segments [3.8]. 



3.8 Markov Models 

The theory of Markov chains has a long history in mathematics and statistics. 
Markov chains were named after Andrei Andreevich Markov (1856-1922), 
who, along with Lyapunov, was a student of Pafnutii Chebyshev.^ Markov 
chains model statistical time series in which the probability of occurrence 
of a given event depends only on the near past. Applications of Markov 
chains abound in physics, economics, biology and, more recently, in speech 
processing. 

Many real-world events can be modeled by a first-order Markov process: 
the probability of an event is given solely by the immediately preceding state 
of the model. In other words, first-order Markov processes have a very short 
direct memory. Think of a radioactive atom (as used in radiation therapy for 
cancer). Its probability of decaying during the next second is independent of 
the prior history of the (ground-state) atom - as long as it still exists. Of 
course, once it has decayed, the probability of decaying again in the future 
is zero. 

^ Among electrical engineers, Chebyshev (1821-1894) is known mostly for the 
equal-ripple filter, also called “Chebyshev filter” because it is based on the Cheby- 
shev polynomials. These polynomials resulted from Chebyshev’s work on trans- 
forming circular into straight-line motion in the newly- invented steam engine. He 
is also noted for his work on the distribution of prime numbers and the generalized 
law of large numbers. 
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A somewhat more sophisticated Markov model concerns Nq couples on a 
dance floor. When the music starts, each dancer is assigned a random dancing 
mate. If it is his own partner, the couple, after one dance, is ordered to leave 
the dance floor and go home. As a result there are now only Ni < Nq (newly 
assigned) couples still dancing. This number is the new state of the dance 
floor model. The music starts again and the process of selection and expulsion 
is repeated. After a while the dance floor is empty. The question is, what is 
the expected number of dances before all the couples are safely home? 

To calculate the required transition probabilities, one needs the probabil- 
ity distribution of the number n of flxed points for a random permutation of 
N different objects: 



p„,„) = yf (-!)*' . 

k=0 

For large N, the above formula gives the well-known Pn{0) ^ 1/e, answering 
the question “if N letters are stuffed randomly into N envelopes, what is the 
probability that none is properly addressed?”^ 



3.9 Shannon’s Outguessing Machine 
— A Markov Model Analyzer 

Not all games of chance are as fair and rousing as dancing with a random 
mate. But some games make no pretensions of fair play; in fact, i^nfairness 
is their very reason for being - such as Claude Shannon’s “outguessing ma- 
chine”, a beautiful application of a Markov model in which the guessing 
behavior of a human being is modeled as a Markov process [3.10]. On the ba- 
sis of observing and analyzing the outputs of a human player in a heads-tails 



'^This formula was once guessed by the author - on the basis of the trivial facts 
that pn{N) — 1/N\ and pn{N — 1) = 0 and the further fact that 0! = 1! so that 
an alternating series would give pn{N — 1) = 0. 

The above distribution, as serendipity taught me, can be generalized by multi- 
plying it by rri^ and, inside the sum, by m^. This yields, for A ^ oo, the Poisson 
distribution with mean m. Curiously, for finite N (and m < 1) the generalized 
formula is again a bona fide probability distribution with many interesting applica- 
tions. It shares numerous properties with the Poisson distribution, except that the 
maximum number of events is bounded. It has the same factorial moments and cu- 
mulants as the Poisson distribution (which is not the case for a Poisson distribution 
that is simply truncated). 

This generalized Poisson distribution was already studied in the 1930s by the 
German mathematician Emil Julius Gumbel (1891-1966). Gumbel was a world 
authority on the “statistics of extremes” (fioods, climate, fatigue failure) [3.9] and 
an early uncompromising pacifist whom Albert Einstein honored with the title 
Apostle of Justice. 
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guessing game, the outguessing machine constructs a model of an assumed 
Markov process underlying the human responses. The machine then exploits 
this model to predict future human behavior. 

More specifically, the entrapping contraption, built by Shannon’s close 
collaborator David Hagelbarger, initially makes random heads-tails choices 
against a human contender. But once the machine has experienced its first 
wins, it begins analyzing the opponent’s “strategy” to a depth of two throws. 
Does he or she change after losing a throw? Does the player keep on choosing 
tails if tails has brought two previous wins? Or does the gambler get chary 
and heads for heads next? For most people such strategies are mostly sub- 
conscious, but the machine assumes the human to act like a second-order 
Markov process and uncovers the underlying transition probabilities without 
fail. 

Exploiting these, the machine always wins over the long haul, except 
against its creator. Shannon, keeping track of his machine’s inner states, can 
beat it 6 times out of 10. Of course, anyone could win 5 out of 10 throws 
on average by playing random (perhaps by flipping a true coin). But this 
is precisely what people, deprived of proper props, are incapable of doing, 
as Shannon’s machine has demonstrated again and again by beating a wide 
variety of human would-be winners. Specifically, the human mind appears to 
abhor long strings of like outcomes - as occur perfectly naturally in truly 
random sequences. 

Of course, the machine can have bad luck, too, especially in its initial 
guessing phase. I once wanted to show off the machine’s prowess to a foreign 
friend (the mathematician Fritz Hirzebruch) visiting Bell Laboratories. As 
luck would have it, Hirzebruch won 13 times in a row before his first loss. But 
thereafter the machine took off with a vengeance, overtaking the renowned 
mathematician on throw 31 (i.e. the machine won 16 out of the next 18 
throws!) and never fell behind again - in spite of the fact that Hirzebruch 
had been told (in general terms) how the machine worked. 



3.10 Hidden Markov Models in Speech Recognition 

First-order Markov processes are characterized by a set of states, a matrix 
of transition probabilities between these states, and the observable outputs 
that the process generates for each transition. The process is set in motion 
by a probability distribution for the initial states. 

Hidden Markov Models have been successfully applied in many difficult 
analysis tasks ranging from the deciphering of cryptograms to speech recog- 
nition. Let us therefore look at a hidden Markov model (HMM) in somewhat 
greater detail. Assume there are N urns filled with a large number of colored 
balls, say red, yellow or blue. The relative frequencies of the colors for each 
urn are known. 
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At the beginning an urn is chosen according to a given probability dis- 
tribution for the initial states. A ball is picked at random from that urn, its 
color, Cl, is announced and the ball is replaced into its urn. 

At each clock time thereafter one of the urns is chosen according to the 
given distribution of transition probabilities. Again a ball is picked from 
that urn, its color, C2, is announced and it is replaced. One of the problems 
is to reconstruct the parameters of the model from the sequence of colors 
Cl , C2 , C3 , . . . . More specifically, 

- Given the observations, what sequence of states (urns) is the most likely? 

- How do we adjust the model parameters to best conform to a large number 
of observations? 

In the first problem we may recognize the speech recognition problem. The 
observations are the result of acoustic measurements of a speech signal, say 
(quantized) spectra or predictor coefficients. The different states represent 
parts (e. g., syllables, phonemes or subphonemic segments) of an utterance 
to be recognized. Each utterance template in the lexicon of the recognizer 
requires a different model. The second problem is the “training problem” 
in which the parameters of the models are optimized to best describe the 
observation sequences. 

3.10.1 The model and algorithms 

Assume that the observations and states are given on a discrete time raster. 
The observed acoustic features are vector-quantized into a finite set of vectors. 
Let the possible states be numbered i — 1 , . . . , AT and the possible acoustic 
feature vectors numbered /c = 1 , . . . , M. Let be the transition probability 
from state i to state j during the time unit and tt^ the probability distribution 
of the initial state for a given utterance model (omitting an index denoting 
the model). Often a “left-right model” is assumed, where the states form 
a chain that can only be traversed in one direction, but each state may be 
repeated and single states may be skipped, thus only aij with j = i, i + 1, i -\-2 
are nonzero and tt^ = 5 u. In state i, an acoustic feature vector number k is 
generated with probability bik. The matrices A = (a^j) and B = (bik) and 
the vector tt = (tt*) characterize the model: A = {A, H, tt). 

When an observation sequence O = (/ci, . . . , /ct) is given, recognition 
means finding the model A which maximizes the likelihood P{ 0 \X). For a 
given state sequence g = (ii, . . . , zt), we have (assuming statistical indepen- 
dence of the observations kt): 

T 

P{ 0 \ g, A) = R P{kt\ it, A) = bi^ki ■ ■ ■ bi^kr ■ 

As the probability of state sequence q is 
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-^(^1'^) ^i\i2 ‘ ' ' 5 

the total probability becomes 

P(0|A) = y]P(0|g,A)P(9|A) 

all q 

~ ^ ^ ^iiki ^i\i2 ^* 2^2 ’ ’ ’ ^it^t • (^'^) 

Executing this multiple sum in a straightforward way would be infeasible, 
requiring on the order of 2TN^ operations. There are fortunately some se- 
quential procedures to achieve this with reasonable expense (on the order of 
operations), such as the following Forward Procedure: 

Oii{i) = TTihik^ , z = 1,...,AT; 

at+i{j) = j t = I, . . . ,T-l, j ^ 1, . . . , N; 

N 

P{0\X) = y^arii)- 



The meaning of at{i) is the probability for the partial observation sequence 
(/ci, . . . , kt) and state it = i Sit time t, given model A. For recognition, P{0\X) 
must be evaluated for each model A in the lexicon. 

The training of a HMM requires much more effort than the recognition. 
For each utterance model, one has to choose an appropriate set of states 
and estimate the probabilities a^j, bik and possibly tt^ from a large train- 
ing set {On} of acoustic realizations of this utterance so that the likelihood 
n^^(On|A) [with P(0\X) from (3.1)] is maximized. The initial choice of 
the states can be random or constructed in a way similar to vector quanti- 
zation and the probabilities can be initialized as constants. This yields an 
initial model A. Based on this model and the training set, new values of 
the probabilities (and thus, a new model A) are obtained which better fit 
the observations. This is iterated sufficiently often (Baum- Welch algorithm). 
Alternatively, a gradient descent iteration may be applied. 

For the Baum- Welch iteration, we need the above Forward Procedure and 
additionally a Backward Procedure: 



(dT{i) = 1, i = 1, . . . ,iV; 

N 

j=i 



The meaning of j3t{i) is the probability for the partial observation sequence 
(/ct_^i, . . . , / ct), given model A and state it = i at time t. Let yt(i) be the 
probability for state i at time t, given the model and the observation sequence: 
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7t(i) =P(it =i|0,A) 

= P{0, it = i\\)/P{0\\) 

= at{i)(3t{i)/P{0\\) 

N 

i' = l 

Similarly, let ^t(i, j) be the joint probability for states i at time t and j at 
t + 1 , given the model and the observation sequence: 

= P{it it+i =j\0,X) 

= P{0, it = i, it+i=j\X)/P{0\\) 

= at{i)aij Pt+i{j)/P{0\X). 

Obviously, 7 t(i) = The meaning of 7i(0 is the expected 

number of transitions from state i and that of Ylt=i is the expected 

number of transitions from i to j. Thus, new model parameters can be esti- 
mated as follows: 






= 7i(i)- 



T-1 



t=l 

T 



t=l 

T 






f=i 

ki=k 






Model is always equally or more likely than A. This reestimation is 

iterated as often as required. 

The formulas have been written for a single utterance O only. For actual 
training, many realizations On of this utterance with probabilities P{On\X) 
must be used. Then a, f3^ 7 , ^ and the observations kt depend on n (in 
left-right models, tt^ = 71 (i) is independent of n). Then in the reestima- 
tion formulas, the numerators and denominators have to be preceded by 
ZnPiOn\X)-P 

The approach outlined here is only one possibility for a simple case. There 
are many variants of HMMs and training methods, for instance, dealing with 
continuously distributed observations rather than discrete values. These dis- 
tributions are then usually assumed as mixtures of Gaussian densities. Also 
numerical tricks such as rescaling some quantities are often required. Further, 
the HMM method may be merged with autoregressive (LPC) description of 
the observation data or with neural network techniques. 
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3.11 Neural Networks 

Serial or von Neumann computers^, the type of computer that dominates 
present markets - from laptop to mainframes - have become faster and faster 
over the years. Yet, compared to biological computers, they are still frustrat- 
ingly slow. The human eye, for example, almost immediately recognizes an- 
other human face - no small computational feat. The reason of course is that 
biological computers, such as our brains, process data not in a slow serial 
mode but in highly parallel fashion. Taking a cue from nature, engineers too 
have developed parallel computers but none with the breathtaking parallelism 
that we find in living systems. It is possible that this picture will change once 
“nonstandard” computers [3.12] based on genetic ingredients ( DNA, RNA, 
etc.) or quantum mechanics are available.^ In the meantime parallel computa- 
tion has advanced along a different avenue: neural networks. Neural networks 
have their origin in the work of Warren McCulloch and his student Walter 
Pitts in the early 1940s [3.13]. They showed that with a network of simple 
two-state logical decision elements ( “artificial neurons” ) they could perform 
Boolean algebra. And, according to A. M. Turing, any computable function 
can be computed by Boolean algebra. 

In 1949 D. O. Hebb introduced “Hebb’s rules” for changing the strengths 
of the connections between individual artificial neurons and thereby enabling 
such networks to adapt and learn from past experience. Early applications of 
neural nets, although not usually identified as such, are adaptive equalizers for 
telephone lines (important for maximizing the bit rate of your computer’s mo- 
dem) and adaptive echo cancelers (crucial in satellite communication). Tino 
Gramss simulated a neural network that learned to walk upright - after innu- 
merable tumbles during the learning phase. And Terrence Sejnowski designed 
a neural net for learning how to speak (given a written text). Apart from 
such instructive demonstrations as walking, reading or balancing a broom on 
a fingertip, neural networks have found many useful applications in pattern 
recognition (handprinted characters, faces, speech), the assessment of risks 
and credit ratings, the control of automated manufacturing plants, predict- 
ing stock and bond prices, optimal resource allocation, the traveling salesman 
problem, and how to back up a monster trailer truck. 

In all, neural networks have become an important ingredient of Artifi- 
cial Intelligence (AI); see the Glossary. Early pioneers of AI include Herbert 
Simon, Allan Newell, and Marvin Minsky. 

^The designation of serial computers as von Neumann computers could be mis- 
leading because von Neumann was one of the early thinkers about artificial neurons 
and neural networks for computing [3.11]. 

^One of the quirks of quantum systems is that, like a super juggler, they can 
keep innumerable “balls” in the air at the same time - until they are observed, when 
the wave function is said to collapse to a single state. Anticipated applications are 
factoring of 1000-digit numbers and pattern recognition, including automatic speech 
recognition. 
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By 1987 the time was ripe for the First International Conference in Neural 
Networks in San Diego, a mammoth meeting attended by such luminaries as 
Leon Cooper of superconductivity Cooper-pair fame. 

3.11.1 The Perceptron 

But already in 1957, F. Rosenblatt designed his famous perceptron for clar- 
ifying data [3.14]. The perceptron adds weighted inputs from two or more 
sensors and feeds the sum into a hard limiter with a given threshold. If the 
sum exceeds the threshold, the perceptron’s output is 1; otherwise it is 0. The 
perceptron is trained by adaptively changing its weights. During the training 
phase, each weight is updated by adding or subtracting a small increment 
proportional to the difference between the desired and the actual output. 
While the perceptron enjoyed its year in the limelight, it soon became ap- 
parent that perceptrons could only classify data if the different classes fell on 
different sides of a straight line (or hyperplane) in the data space. In fact, 
in 1969, Minsky and Papert gave a formal proof of the perceptron’s main 
deficiency, which, incidentally, implied that one of the most basic logical op- 
erations, the “exclusive or” or XOR function, could not be implemented by 
a perceptron. This was, however, already known to Rosenblatt and he had 
proposed multi-layer perceptrons to overcome this problem. But the impact 
of the Minsky/Papert paper was so powerful that much of neural network 
work came to a virtual standstill until its rebirth - with a vengeance - in the 
1980s. 

3.11.2 Multilayer Networks 

Multilayer perceptrons lead to general multilayer neural networks, in which 
the outputs of the first layer are weighted and summed to form the inputs 
to the next or “hidden” layer. In this manner an arbitrary number of hidden 
layers could be interposed between inputs and output, but it was shown 
that, given enough nodes, just one hidden layer is sufficient to distinguish 
between arbitrarily complex decision regions in the input space. ^ However, 
multiple hidden layers are sometimes preferred because they can lead to faster 
learning. 

3.11.3 Backward Error Propagation 

One of the main reasons for the resurgence of neural networks was the inven- 
tion by Paul Werbos in 1974 of the backward error propagation or “backprop” 

^The formal proof is based on work by Andrei Kolmogorov in connection with 
David Hilbert’s 13th problem concerning the factorizability of polynomials. (Of the 
total of 23 crucial problems that Hilbert posed at the International Mathematics 
Congress in Paris in 1900, several remain unsolved to this day.) 
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algorithm to train multilayer networks. This backprop algorithm was redis- 
covered and patented by D. Parker in 1982 and brought to wide attention by 
the influential group led by Rumelhart on “parallel distributed processing” 
at the University of California in San Diego [3.15]. In the backprop algorithm 
the input information is propagated through the network from one layer to 
the next in a feedforward manner. The outputs of the last (highest) layer 
are then subtracted from the desired outputs and the differences are used to 
adjust the weights of all the layers in order to minimize the differences (in 
a least-squares sense). The weight corrections are computed layer by layer, 
starting with the highest one and propagating the differences downwards 
through the network, which is then operating in a reversed manner. 

Not infrequently the hidden layer or layers of a fully trained network 
represent some pivotal aspects of the task. Thus, in Sejnowski’s speaking 
network some of the internal nodes represent, at least vaguely, the vowel 
sounds of the language. 

Backprop networks can also be used for noise reduction of noisy signals. 
In the learning phase, noise- free signals are used as inputs. Furthermore, by 
progressively reducing the number of internal nodes, such networks can also 
be used for data compression. But no matter how well multilayer networks 
perform, it is always advantageous to preprocess the data, for example by 
transforming speech spectra to the hearing-related Bark-scale or by replacing 
spectra by cepstra. (Using only the low quefrency portion of a cepstrum 
eliminates sensitivity to the possibly irrelevant fine structure ( “pitch” ) of the 
spectrum.) 

3.11.4 Kohonen Self-Organizing Maps 

Kohonen’s self-organizing feature maps are an example of unsupervised learn- 
ing - as opposed to the supervised learning in which, during the learning 
phase, the outputs of a neural net are compared with the desired (“correct”) 
outputs [3.16]. 

Learning in a two-dimensional Kohonen net proceeds as follows. Starting 
with “random” weights Wij, the squared distances (xi — WijY between the 
inputs Xi and their weights are summed to yield the total squared distance 
dj to the jth node. (Note that the two-dimensional Kohonen net is indexed 
by a single discrete variable.) Then select the minimum dj and, for a given 
neighborhood of the jth node, update all weights by the following rule 

Wij{t + 1) = Wij{t) -h a{xi - Wij{t)) , 

where the factor a and the size of the neighborhood are slowly decreased 
during the learning phase. Once the learning is completed for a neighborhood, 
select another neighborhood and repeat the procedure until all nodes in the 
net have been trained. The result of this procedure for speech spectra as 
inputs is a two-dimensional “phonemic” map that is traversed from one node 
to a nearby node as the utterance proceeds. The two dimensions of a trained 
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Kohonen map for speech recognition often resemble the plane of the first two 
formant frequencies that distinguish the different vowel sounds. 



3.11.5 Hopfield Nets and Associative Memory 



The work of the physicist John Hopfield at Bell Laboratories and Princeton 
University was one of the strongest stimuli for the revival of artificial neural 
networks in the 1980s [3.17]. In contrast to the networks discussed so far, 
Hopfield nets have fixed weights. Furthermore, they are symmetric: wij — 
Wji. These weights represent a number of “memorized” patterns m = 
1 , 2 ,..., M, as follows 

Wij = i ^ i = 1, 2, . . . , A/’ . 

m 

The diagonal elements wu are set equal to 0. To associate an unknown input 
pattern with the diflFerent memorized pattern, the following expression is 
computed 



Xi{\) = sign 



3 



and iterated until the Xi(n) do not change much anymore. For symmetric 
Wij^ this procedure is equivalent to a search in an energy landscape. Thermo- 
dynamics teaches that this procedure always converges to a, possibly local, 
minimum in the energy landscape. 

Sometimes the convergence gets stuck in a “wrong” local minimum. To 
allow the convergence to proceed to a new, lower and therefore better, min- 
imum, the energy landscape is moved up and down in a random fashion as 
if the corresponding physical system was sitting in a heat bath and the heat 
was turned on. (This is akin to the - illegal - tilting and shaking of pin- 
ball machines.) Subsequently, the temperature is slowly lowered until a new, 
hopefully lower, minimum is reached. In analogy to metallurgy, this powerful 
procedure is called “simulated annealing.” 

If the input pattern ^^(O) is near one of the memorized patterns then 
Xi{n) will converge on that pattern. In addition, even if the input is degraded, 
convergence to the “correct” pattern will usually occur. Here the degradation 
can consist of additive or multiplicative noise or partial obliteration. Thus, 
the Hopfield net acts as an associative memory: you show the net part of a 
human face and it will reconstruct the rest. 

Associative memory is one of the preeminent attributes of the human 
brain: play a certain dance tune and you will suddenly see in your mind your 
long-lost partner with whom you danced, etc. 

The number M of different memorized patterns a Hopfield net can distin- 
guish is limited to about one seventh of the number N of its nodes, provided 
the patterns are reasonably orthogonal to each other. This must be compared 
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to 2^ patterns for a binary memory. Thus, in terms of pure storage capacity, 
Hopfield nets are a poor medium indeed. Their strengths lie in their com- 
putational abilities. In addition to associative memory tasks, Hopfield nets 
have been used for waveform reconstruction and numerous other tasks, in- 
cluding the (in) famous traveling salesman problem: what is the shortest route 
between a large number of cities. 

If the weights wij are not symmetric, convergence will usually not take 
place, but new applications become possible such as recognizing a sequence 
of patterns (such as the phonemes in running speech). 



3.12 Whole Word Recognition 

Most automatic speech recognizers attempt to identify simple phonemes or 
syllables and then string these together into words. By contrast whole- word 
recognition is based on the analysis of an entire word or even a sequence of 
words. In this approach the speech spectrogram of a word, i.e. its energy (on a 
gray scale) is considered as a picture and analyzed by known image processing 
methods [3.18]. Specifically, the two-dimensional Fourier transform of the 
speech energy (or an equivalent transform) is calculated. 

Only the low-frequency components of the transformed image are re- 
tained, thereby making the analysis independent of limited fluctuations in 
speaking rate and formant frequencies. Typically, a total of 49 Fourier coeffi- 
cients gains excellent recognition results for both high-quality and telephone- 
quality speech [3.8]. 



3.13 Robust Speech Recognition 

Robust speech recognition addresses the multifarious adverse conditions en- 
countered in practical applications of ASR such as noise, reverberation, poor 
frequency response, and speaker variability. These questions have recently 
been addressed in a special issue of Speech Communication [3.19]. Among 
the important remedies is the emphasis on the modulation transfer func- 
tion [3.20]. Like human hearing, which is most sensitive to modulation fre- 
quencies around 4 Hz, see Fig. 3.3, machines, too, stand much to gain from 
enhancing those modulation frequencies that are predominant in speech, i.e. 
2-8 Hz [3.21], [3.22]. Figures 3. 4-3. 8 show the effects of modulation- frequency 
filtering on “clean,” moderately noisy, very noisy, moderately reverberant and 
very reverberant speech. 
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Fig. 3.3. The modulation transfer function of human hearing (labeled “internal” ) 
compared to the modulation spectrum of speech at a normal speaking rate. Note 
that the drop-off of the auditory system beyond 8 Hz allows all speech modulations 
to pass unhindered. - Also shown is the modulation transfer function of a lecture 
hall (reverberation time T = 0.5 s) and a concert hall (T = 2 s). The latter especially 
makes speech very difficult to understand because long reverberation “blurs” the 
modulation 



3.14 The Modulation Transfer Function 

The modulation transfer function of a system characterizes the loss (or gain) 
of the degree of modulation of a modulated signal applied to its input. The 
fact that the human eye cannot follow rapid fluctuations in light intensity 
but perceives a steady luminance has allowed the motion picture and televi- 
sion industries to replace a continuously moving image by a discrete sequence 
of still images called frames. In television this inertia in visual perception is 
calles “flicker fusion” : above a certain frame rate, say 60 per second, depend- 
ing on viewing angle and light intensity, continuous motion is perceived. 

Similarly, our sense of hearing is characterized by a certain inertia. The 
human ear perceives best modulation frequencies around 4 Hz. For modula- 
tion frequencies above 20 Hz there is a rapid fall-off of our ability to hear 
the modulation as modulation. [Certain birds, guineafowl {numida malea- 
gris) among them, are sensitive to and in fact use much higher modulation 
frequencies.] 

It is interesting to note that human speech at normal speaking rates has 
a peak in its modulation- frequency distribution at 4 Hz. Is this a case of 
co-evolution of human speech and human hearing? 

In a reverberant environment (think of a concert hall) the higher modula- 
tion frequencies of speech and music are attenuated by the smoothing effect 
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modulation spectrogram: clean speech 
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Fig. 3.4. Top: spectrogram of the utterance “two oh five” by a female speaker. 
Bottom: spectrogram for the same signal with modulation frequencies between 2 
and 8 Hz enhanced. (Figures 3. 4-3. 8 were kindly supplied by Steven Greenberg and 
Brian E.D. Kingsbury) 



of the reverberation. (This is in fact why music needs sufficient reverberation 
to sound good.) This attenuation in the degree of modulation is described 
by the modulation transfer function (MTF). In fact, by measuring the MTF 
with music, the reverberation time of a hall can be determined during an 
ongoing concert in the presence of an audience. (Habitually reverberation is 
measured by means of bangs - pistol or cannon shots - and filtered noise in 
the empty hall whose acoustics are of course quite different from the response 
of the occupied hall.) 

Measurement of the MTF, which is also reduced by ambient noise, is now 
a preferred method of estimating speech intelligibility [3.23]. It is interesting 
to note that, for linear systems, the inverse Fourier transform of the MTF 
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Fig. 3.5. The same as Fig. 3.4 for moderately noisy speech 
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is the squared impulse response of the system [3.24] - just as the inverse 
Fourier transform of the transfer function itself is the impulse response. Thus, 
the MTF fits into a nice symmetrical pattern when Fourier transforms are 
considered: What the transfer function is to the impulse response is the MTF 
to the squared impulse response. 
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Fig. 3.6. The same as Fig. 3.4 for very noisy speech 







Frequency (Hz) Frequer^cy (Hz) 



64 



3. Speech Recognition and Speaker Identification 




wideband spectrogram: moderately reverberant speech 
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Fig. 3.7. The same as Fig. 3.4 for moderately reverberant speech 
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Fig. 3.8. The same as Fig. 3.4 for very reverberant speech. The improvements due 
to modulation-frequency filtering are evident for all tested impairments 







4. Speech Dialogue Systems 
and Natural Language Processing 



by Holger Quast, Drittes Physikalisches Institut, Universitat Gottingen, 
and Robert Bosch GmbH Research and Development 



Polonius: What do you read, my lord? 

Hamlet: Words, words, words. 

William Shakespeare (1564.-1616)^ Hamlet 



If language is the dress of thought as Johnson claims, speech is its genuine 
overcoat. Traditionally, speech and language have been studied separately, 
the former as an acoustic phenomenon, the latter as representation of mean- 
ing, symbolized by words. Slow but steady progress in the field of speech 
recognition together with increasing computational power made it possible 
to use our natural verbal communication channel as front end for speech di- 
alogue systems, and moreover have allowed the first applications of this kind 
to step out of the laboratories into the real world. During the second half 
of the nineties, actual products started to appear that were more than mere 
technological gadgets: dictation programs for word processors, voice dialing 
for mobile phones, and dialogue engines that carry out tasks such as air travel 
reservation, banking, and even translation over the phone. While these de- 
vices are often still cumbersome to work with, they bear the potential of the 
most natural human -machine interface. Spoken language dialogue programs 
provide useful services where human operators would be too expensive, they 
are an invaluable help for physically incapacitated people, and allow to con- 
trol devices where hands-free operation is safer and speech input has a higher 
data throughput than a tactile interface. This chapter introduces some in- 
strumental concepts and techniques from the field of language processing, 
and its application, natural language speech dialogue systems. 



4.1 The Structure of Language 

Language, as a structured communication of verbal symbols, is a universal 
characteristic of our species. The Ethnologue currently lists 108 language 
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families with more than 6800 spoken languages [4.1], evolved under the most 
varied conditions, in diverse cultural and geographic settings. How can we 
possibly hope to find a scientific framework that does justice to all of them? 
Well, fortunately, despite their vastly different backgrounds, certain general 
qualities inherent to all languages do exist that allow common ways to analyze 
them. 

4.1.1 Prom Sound to Cognition: 

Levels of Language Analysis and Knowledge Representation 

Suppose you saw the following sentences: 

1. Hamlet is the prince of Denmark. 

2. Hamlet is the prince of Portugal. 

3. a) Hamlet is the prince of Denmark, please. 

3. b) Hamlet is the prince of Denmark, and Hamlet is no prince. 

4. The Portugal is Hamlet of prince. 

5. The prince of Denmark Hamlet is. 

6. gkgkgkgkg kgkgkg. 

In a conversation about Shakespeare, nothing is wrong with the first state- 
ment. Having read Hamlet however, our learning of Elizabethan literature - 
part of our world knowledge - tells us the declaration formed by the words 
of the second sentence is incorrect. Hamlet is the Prince of Denmark, not of 
Portugal. No knowledge of English classics is required to spot the pragmatic 
errors in examples 3a and b: we can’t really think of any ordinary context 
in which these sentences would be used. Sentence 4 is met with equally low 
acceptance. Assuming its constituents are arranged in the common English 
noun phrase - auxiliary verb - noun phrase order, in the way they modify 
each other, they do not add up to a proper sentence meaning, to proper se- 
mantics. Unless you were a grandmaster Jedi knight or hexed by a wicked 
fairy to converse in iambic tetrameter, you would perceive the phrase struc- 
ture, the syntax^ of remark 5 as rather eccentric. Finally, chaos and confusion 
rule line 6. Not only does the lack of components that make up proper English 
words reveal the example to be morphologically ill-formed. It is impossible to 
distinguish between ‘g’s and ‘k’s in the given sequence without an opening 
of the glottis between letters - the words are also fiawed from a phonological 
point of view. 

Phonetics and Phonology. The above examples illustrate the most promi- 
nent levels on which language is typically examined. Approaching language 
analysis from the speech side, the first sensation that comes to our ears is the 
sound of the spoken utterance. The study of how these sounds are generated, 
transmitted, and received is called phonetics. A number of notations were 
developed to transcribe the physical signal into higher-level written phonetic 
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symbols. Common alphabets are the traditional standard International Pho- 
netic Alphabet (IPA), the TI/MIT set and its close relative, the ARPAbet, 
which allow to transcribe speech sounds into ASCII letters, the Worldbet, de- 
veloped at AT&T to be able to represent multiple languages in ASCII, and 
SAMPA, an ASCII based scheme designed to represent the major European 
languages of the EUROM corpus. The smallest distinguishable vocalization 
unit, an “atomic” speech sound so to speak, is called a phone. 

Whereas the study of phonetics comprehends speech entirely on a sig- 
nal level, phonology is already a language-dependent discipline of linguistics. 
In this field, the phonetic patterns of a particular natural language are in- 
vestigated and grouped into phonemes, the smallest contrastive units - i.e. 
necessary to distinguish between words - of a language. While a phone corre- 
sponds directly to a physical signal, a phoneme is an abstract linguistic unit 
that can have various different phones, i.e. acoustic manifestations, associ- 
ated with it. These constitute the set of allophones for the given phoneme. 
Often, the articulatory and acoustic differences between allophones are not 
easily recognized by the speaker of a language. Consider the variants of the 
phoneme /t/ in the following examples: the aspirated t of “top,” the one in 
“stop” (not aspirated), and the tt as pronounced in “butter” produce allo- 
phones of the same phoneme. 

Morphology. One step further away from the speech sound, in the field 
of morphology, the study of internal word structure, we encounter a similar 
formalism: the smallest unit of meaning in a language is called a morpheme^ 
morphs describe the actual realization of a morpheme, and the class of morphs 
belonging to one morpheme is formed by its allomorphs. Consider for example 
the word “undoable” which is made up of three morphemes, the prefix “un,” 
the root “do,” and the suffix “able.” The morphs “ir” and “in” in irrelevant 
and indestructible are allomorphs, appearing in different morphological en- 
vironments. In an English language dialogue system, a verb’s suffix-morph 
such as “s,” “ed,” or “ing” can for instance be used to indicate the time of 
the action through the verb’s tense and therefore give clues for succeeding 
analyses. 

Syntax. The structure of sentences is governed by the rules of syntax. It 
describes the way word types can be combined to form phrases and clauses 
that make up a sentence, and therefore also determines what structural func- 
tion a word can assume in a particular sentence. In traditional Chomskyan 
linguistics, this set of rules forms a language’s grammar^ a most powerful 
architecture of thought inherent only in human communication^ . (In general 
linguistics, grammar is understood as a language’s entire morphosyntax.) 

^ While the final verdict is still pending, most researchers believe that other 
species which communicate vocally are only able to transmit nonverbal information 
such as emotions or emotive signals, or at best a string of semantic chunks referred 
to as protolanguage by Bickerton [4.2]. 
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The concept of grammars and their representation as trees is most useful 
for machine text analysis as well, see Fig. 4.1 and the following sections on 
grammars and parsing. 



N VP 
/ \ 

V N 

i i 

Bob eats fries. 



S: sentence 
VP: verb phrase 
V: verb 
N: noun 



Fig. 4.1. A simple syntax-tree representation of the sentence Bob eats fries 



Semantics. As we enter the world of meaning, we first analyze the senses of 
the phrase’s words and the relations between those to understand the mean- 
ing of the entire sentence. This study is called semantics. Its most common 
representation in natural language processing is the logical form [4.3], see 
below. The largest unit of meaning we are dealing with in semantics is one 
sentence. If we were investigating a sequence like “Jimi plays guitar. The 
valves of this instrument are made of shining brass,” each sentence’s seman- 
tics could be properly represented, and we would not catch the contextual 
discrepancy that “this” instrument can’t refer to “guitar” (because our world 
knowledge tells us guitars don’t have valves). 

Pragmatics, Discourse and World Knowledge. The next analysis layer 
closes the context gap between an utterance and its background. Here, we 
understand language as it is used in an environment, or context, including a 
sentence’s affect on the communicating agents. We employ discourse knowl- 
edge and world knowledge such as the ability to understand the situation 
of the conversation and a memory of sentences already uttered, the general 
principles of conversation in a culture and language, and understanding of 
the world both on a situation-specific and a general level. In general, the 
entire analysis on this level, that is, meaning interpretation and integration 
on the basis of semantic evidence, contextual and world knowledge, is under- 
stood to belong to the field of pragmatics^ the study of those principles that 
form our understanding why certain sentences are anomalous, or not possible 
utterances in a given context [4.4]. 

The sequence in which these layers are presented here is indeed a common 
processing order for an utterance in a speech dialogue system: First, the 
incoming - phonetic - speech signal is transferred to text by the speech 
recognition unit. Its output in turn is analyzed by a parser to determine the 
syntactic structure. After a semantic representation like the logical form (see 
p. 76) has been established, discourse and dialogue management functions are 
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putting the utterance in the context of the current dialogue and the domain 
it is working in. This information state is the basis to devise the further 
strategy. If a response is to be formulated, the process is now reversed. The 
system starts with the plan to generate an answer and ideally applies the 
same discourse concepts and grammars it uses to analyze incoming speech 
to formulate its own utterance, from the knowledge representation down to 
a speech synthesis module in which text is transformed to speech. 

4.1.2 Grammars 

To define a language on the syntax level, it is common to use a set of rules 
that is part of the language’s grammar. A formal grammar consists of a set 
of word categories (often called parts of speech), rules how to combine these 
categories to higher level phrases, the set of these phrases, and an initial 
symbol which represents the highest-level phrase. These rules can then be 
applied to generate strings of words in a language, and to parse a string, i.e. 
find a sentence’s syntactic structure. Figure 4.2 gives an example of a small 
grammar and a sentence’s tree representation. 



S 

/ \ 



NP 

NAME 



VP 

/ \ 



V NP 

I ^ \ 



ART NP 



/ \ 

ADJ 

i 
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I 



Jimi plays the red guitar. 



Start symbol: 

S: sentence 



NP: noun phrase 

VP: verb phrase 



Terminal symbols: 

N: noun 

NAME: proper name 
V: verb 

ART: article 

ADJ: adjective 



Lexicon: 

guitar: noun 

Jimi: proper name 

plays: noun, verb 

red: adjective 

the: article 

Rewrite rules: 

S — NP VP 
NP — NAME 
NP ART NP 
NP — ADJ N 
VP — V NP 



Other nonterminal symbols: 



Fig. 4.2. A simple formal grammar and a tree representation of a syntactically 
well-formed sentence in this language 



In the parse tree, the top element S stands for the whole sentence “Jimi 
plays the red guitar.” This start symbol initiates the parse. Looking at the 
grammar’s rewrite rules, we find that a sentence in this language can only 
consist of a noun phrase followed by a verb phrase: the left-hand side symbol 
S turns into the two right-hand side symbols NP and VP. Both noun phrases 
and verb phrases are nonterminal, meaning they are broken down into smaller 
elements, either further rewrite rules or terminal symbols. The first noun 
phrase consists of only one element, “Jimi,” and a look in the lexicon tells us 
this word is a proper name. Indeed there is one rule that matches this case: 
NP — > NAME. NAME, representing word class proper names, is a terminal 
symbol, so we are done parsing this branch. At the other branches, we proceed 
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likewise, applying rewrite rules whose left-hand side symbols match the given 
nonterminal entries in the parse tree, and substituting them with the symbols 
on the right side of these rules until we are left with all terminal symbols. 
Note that the lexicon - i.e. the table containing a language’s words and their 
respective parts of speech - maps ‘plays’ to two entries, ‘plays’ can either be 
a verb or a noun. But since this language does not map verb phrases to any 
sequence that starts with a noun, the grammar saves us from an ambiguity 
deadlock. 

Undeniably, the generative capacity of the above grammar does not allow 
us to discuss Shakespeare. Looking at the other side of the complexity scale, 
a computer program written in natural language would be - unfortunately 
so - an insurmountable challenge for today’s compilers. A ranking of formal 
languages and their grammars describing their generating abilities was in- 
troduced by Chomsky in 1956, now known as Chomsky hierarchy [4.5], see 
Fig. 4.3. 



Type 0 grammar 
no restrictions 

Type 1 grammar 
aAp -► ayp 

Type 2 grammar 
A -► y 

Type 3 grammar 

A -► a 
A -► aB 




Fig. 4.3. The Chomsky hierarchy of formal languages. The languages with more 
restrictive grammars (high type numbers) are contained in the languages with more 
powerful grammar classes (lower type numbers). A, B symbolize nonterminal ele- 
ments, a denotes a terminal, and o, /I, 7 are strings of terminals or nonterminals 



Type 3 grammars - these generate regular languages. All rewrite rules 
have exactly one nonterminal symbol on the left side and a single terminal 
symbol on the right-hand side that can be followed by exactly one non- 
terminal item. The initial (left-hand side) symbol can map to the empty 
sequence. The grammar can be represented by the finite-state-automaton 
mechanism illustrated below, see Fig. 4.4. 
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Type 2 grammars ~ the syntax of context-free languages. Also called 
phrase-structure grammar or Backus-Naur form. These are defined as 
rules having one single nonterminal on the left-hand side and an (un- 
restricted) combination of terminals and nonterminals on the right. Its 
straightforward structure with only one element on the left side of a 
rewrite rule allows easy parsing and therefore makes context-free gram- 
mars (CFGs) the structure of choice for programming languages. 

Type 1 grammars - they form context-sensitive languages which are an 
extension of context-free languages insofar as they allow rewrite rules 
with more than one symbol on the left-hand side, including terminal 
symbols - just as long as there is at least one nonterminal on the left 
side and the number of left-side symbols doesn’t exceed the number of 
right-side symbols. Rewrite rules can be expressed as aAf3 a^(3 with 
A a nonterminal and ay/? strings of terminal and nonterminal symbols. 
a and f3 may be empty, 7 must not. Those elements other than the one 
we are replacing are the rule’s context. Type 1 and type 2 grammars can 
also be expressed as finite state automata (see below) if we equip them 
with memory to store symbols. 

Type 0 grammars finally have no restrictions on their grammar rules. 

This conception is based on work by Chomsky in the 1950s/60s and serves 
nicely to illustrate how language can be systematically structured. Since 
then, a number of other formalisms have been developed, e.g. government 
and binding in the 1980s, and later minimalism., methods based on feature 
structures and unification, lexicalization, and many more. Even though a 
natural language, i.e. a language spoken or written by humans, cannot be 
adequately characterized by means of a formal grammar to define generative 
capacity [4.3], the tools we obtain from grammar theory are invaluable for 
natural language processing as we will see in the following sections. 



4.1.3 Symbolic Processing 

Two main concepts dominate the field of natural language understanding: the 
symbolic and the statistical approach. The former was heavily influenced by 
Chomsky’s work on grammars. As in the classical, ‘neat’ branch of artificial 
intelligence, language capacity is explicitly coded as rules and decision logic, 
just like in an expert system. For this purpose, a large box of tools has been 
developed which implement and help to visualize the system’s logic. 

Finite State Machines. One of these tools that has found extensive ap- 
plication on many levels of symbolic language analysis is the finite state au- 
tomaton (FSA), also called finite state machine or transition network. It 
consists of states and arcs that are traversed to match a rule or requirement. 
In the example below, the network represents a noun phrase rewrite rule: 
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ART ADJ NOUN 




Fig. 4.4. A finite state automaton for the noun phrase NP — ^ ART ADJ NOUN 

Say we are checking a phrase “the green door.” The automaton is started 
and set to the ini state. The first word to check, “the,” matches the article 
category, so the first arc can be traversed and the network proceeds to state 
NP 1. “green” takes us to state 2, “door” to the end state; the automaton 
was passed successfully, we have a noun phrase. The capacity of this little 
scheme is equivalent to a type 3 (regular) grammar. 

If we allow arcs that begin and end at the same state, we get a recursion 
that enables us to encode repeating phrases or parts of speech in one rule, 
e.g. to encode “the ugly old green wooden door,” where we have a looping 
adjective arc before the net arrives at the noun state. This handy extension 
turns a regular, deterministic network - whose state is determined at every 
time of the recognition process - into a non- deterministic FSA with more 
generative power. 

Add to this the ability to refer to other networks and parts of speech and 
we obtain a structure capable of describing a context-free grammar. This 
setup is called a recursive transition network. Since we can now reference 
other networks, we are able to build cascaded rules - where one network 
containing a nonterminal links to another network, waits until that one is 
completed, then picks up its process - and thereby represent a whole CFG 
parse tree. 

One can also use an automaton to map from one set of symbols to an- 
other one, for instance to translate strings. This construct, the finite state 
transducer (FST, cf. [4.6]) is commonly applied in morphological analysis, 
e.g. to describe derived forms of a word stem^. Consider the rule 

mice ^ mouse N PL 

^Algorithms like these are called stemmers. They are very useful in data mining 
applications. Say you went to your favorite internet search engine and looked 
for “swimming,” you would also want it to consider keywords such as “swim,” 
“swam,” “swimmer,” or “swimmable.” The stemming tool turns all of these items 
into their word stem and ideally allows the search engine to find all related forms. 
Of course, one could abuse one’s memory and record all derived forms of a word, 
but, for English, a clever general approach (see [4.7]) works rather well: apply a 
(manageable) number of rules, e.g. morphotactics for plurals and participles, and 
then simply replace or remove suffixes in three steps (oversimplifying Porter’s nifty 
algorithm), such as in “ational” ^ “ate” to turn relational into relate, or “ization” 
^ “ize” ^ e. The accuracy of this simple method is surprisingly good, it plays in 
the same league as elaborate algorithms with vast lexica. However it does create 
a number of droll errors and ambiguities as listed in [6]: organization ^ organ, 
university ^ universe, doing doe, etc. 
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meaning we attribute word form “mice” to its singular root form “mouse,” 
part-of-speech tag N and number tag PL. How can this be accomplished with 
an extended FSA? To induce generating capacity, we now don’t just dehne a 
network by a set of symbols like a hnite state machine, but by a set of pairs 
of symbols. This way, the above formula can be written as 

m:m, i:o , e:u, c:s, e:e, e:N, e:PL , 

where m in ‘mice’ gets mapped to m in ‘mouse,’ i to o, the empty symbol e 
to u, and so forth. The algebra of FSTs allows to invert a transducer, so the 
same machine can be used for analysis and generation. And, of course, we 
can also apply transducers to check whether two strings correspond to each 
other, to generate string pairs, or to hnd relations between sets of strings. 

Feature Structure. Many languages, including English, have morphologies 
that enable us to extract substantial information from words, for instance 
about tense and mode, whether we are referring to the hrst, second, or third 
person, the number (singular or plural), and so forth. This concept is known 
as feature structure. In the example above, we recognize the word form “mice” 
to be a noun in its plural form. Say we were parsing the syntax of the sentence 
“I know this: mice don’t fly.” Even though we don’t get any punctuation from 
our speech recognition unit, we can tell straight away whether “this” is the 
object of a noun phrase “I know this,” or the article of an ART N noun 
phrase “this mice,” because the number features for this (ART singular) and 
mice (N plural) do not match. 

It is often possible to observe the desired features in the item that deter- 
mines the syntactic function of a phrase. This item, referred to as the phrase’s 
head., passes its feature structure up to the parent level of the syntactic anal- 
ysis where feature matching is done again on the next nonterminal level until 
root level is reached. 

Very convenient is a class of verb features called subcategories ^ these ex- 
plain which phrases or parts of speech can follow a particular verb. Consider 
“to want” which is followed by a verb phrase with the following verb in its 
inhnitive form, or “to sleep” which is intransitive, i.e. doesn’t take a direct 
object at all. These restrictions are very helpful for parsing sentence syntax 
because many ambiguities can be eliminated right away through this ad- 
ditional knowledge. Head-lexicalized grammars make use of this mechanism 
by storing the constraints of each word, e.g. for entry “to give” the sub- 
category NE_NP (“to give something to somebody”) in a lexicon, and then 
reduce the number of possible rewrite rules for a phrase according to the 
word that is heading the phrase being analyzed. Word features are useful for 
syntax analysis to such an extent that they can make standard context-free 
grammars unnecessary; grammatical rules can be implemented entirely based 
on features. Words and phrases that have non-contradictory features can be 
combined to larger structures in a process known as unification. 




76 



4. Speech Dialogue Systems and Natural Language Processing 



Logical Form and Knowledge Representation. Having examined a 
text’s morphology and syntax, we are ready to extract its meaning. At first, 
we are combining the meanings of all words in one sentence, that is, perform- 
ing a semantic analysis, and translating them into logical form language. In 
this formalism, expressions are formulated with the help of constants, i.e. 
atomic meanings that describe the world’s objects such as predicates repre- 
senting relations, properties, or functions, terms such as objects and events, 
and a number of quantifiers and operators modifying predicates and terms. 

At this level of analysis, thematic roles are attributed to the single seman- 
tic entities of a clause and represented in a logical form that is independent 
of sentence structure. For example, “Jimi plays guitar” and “The guitar is 
played by Jimi” describe the same event and are therefore considered seman- 
tically equal. Its description in a logical form could look like this: 

(PLAY (JIMI GUITAR)) 

meaning a predicate PLAY is defined by its two arguments JIMI and GUI- 
TAR. Or in a more elaborate format (cf. [4.3]): 

(ASSERT (<Present PLAY> pi [AGENT <JIMI jl>] 

[THEME <GUITAR gl>] )) 

indicating that the utterance is an ‘assertion’ speech act (see p. 94), the 
predicate verb ‘play’ in its present tense form is carried out by an agent 
‘Jimi,’ acting upon theme object ‘guitar.’ 

The description of sentence meaning in the initial semantic interpretation 
is context-independent, i.e. we are treating each sentence as a separate unit. 
Consequentially, nobody cares for the identity of either Jimi or his guitar. If 
one of the next sentences were “His instrument is red,” we would not indicate 
the anaphoric relations “his” - “Jimi” and “instrument” - “guitar” in the 
semantic logic description. 

To make these sentence representations accessible for contextual inter- 
pretation, we assign identifiers to the logical form’s discourse variables that 
remain valid beyond the scope of one sentence. In the example above, vari- 
ables jl for “Jimi” and gl for “guitar” are created and stored in a knowledge 
base. This knowledge base holds the discourse objects as well as the descrip- 
tion of the world. Ideally, the knowledge base contains a complete description 
of a system’s environment and is also equipped with a number of inference 
rules that allows to check new utterances and derive conclusions. 

Entering semantic properties for words in a knowledge base is usually a 
tedious task. Consider a lexicalized verb-feature database which encodes that 
traveling can be done in a car. Of course, it can also be done in an airplane, on 
a bike, or in another kind of vehicle. Rather than entering this fact for every 
possible conveyance, we can describe the relation between different classes of 
vehicles as a semantic hierarchy^ see Fig. 4.5, and postulate that traveling 
can be done with a vehicle and all its descendants. 
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Fig. 4.5. Hierarchy of the semantic class ‘vehicle.’ Child nodes inherit features of 
their mother node - and are member of the same class - and add new ones of their 
own. As one traverses the word sense hierarchy from ancestor nodes to offspring, 
the items become more and more specialized 



Just like in object-oriented programming, child objects inherit the prop- 
erties of their parents and extend the ancestor class in some way. Besides 
inheritance, other useful relations describe that a physical object is part of 
another object, or that words are related by similar meaning. The latter rela- 
tions, since they usually form no hierarchy, are best structured as a semantic 
network in which each node can connect with each other node. 

In a language dialogue system, one usually employs different data struc- 
tures for discourse data and domain description. Consider for example an 
automatic phone banking system; here, the description of the world - the 
world of finance - would consist mainly of a huge database of bank accounts 
and a set of functions to read or manipulate the data. Discourse information 
can be stored in frames^ a data format that links the information extracted 
from the user’s utterances to the functionality offered by the system (see the 
following section on dialogue systems). 

As the tools presented here illustrate, all symbolic language processing 
approaches are based on predefined rules. This has the advantage that one 
can easily grasp what a running system does, and the beauty that a system 
programmed this way is actually governed by language understanding rather 
than guesswork. 

4.1.4 Statistical Processing 

The beauty of symbolic language modeling also bears a burden: it is very 
complex. A popular dictionary [4.8] proudly claims to feature more than 
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225000 definitions, that is, semantic and syntactic description. It becomes 
obvious immediately that even a small domain description, incorporating 
just a tiny fraction of such a dictionary, requires a huge number of rules 
which need to be manually defined. Appalled by this tedium and inspired 
by machine learning techniques, computer linguists asked the valid question: 
Can’t a machine just do that on its own? They decided to feed their computers 
huge amounts of text, and, from this corpus-based processing (named after 
the use of text corpora for training), they obtained quantities and statistics 
that help to analyze and even generate language. 

Such a corpus can basically consist of any set of more than one text, 
but especially helpful are those in which the text is accompanied by some 
form of annotation. Common tags are part-of-speech labels, word root form 
(transforming plural nouns to their singular form and tensed verbs to their 
infinitive in process called lemmatization) , phonetic notation, prosody, se- 
mantics, parsing - i.e. syntax - data, and more. A landmark work still in 
wide use is the Brown corpus assembled by Kucera and Francis in the early 
1960s [4.9]. It consists of more than one million words of edited English prose, 
divided into 500 samples of 2000+ words each, printed in the United States in 
1961. Today, e-books are available for free in huge collections on the internet, 
and one can even obtain whole corpora on the web for all major and some 
not-so major languages and genres such as Gaelic or the transcripts of the 
O.J. Simpson trial. 

The following paragraphs introduce some popular concepts, techniques, 
and applications of statistical, corpus-based language processing. 

Word Statistics. The obvious thing to do first is to take a text and count 
the words. In Oscar Wilde’s The Importance of Being Earnest, we find 20471 
symbols that qualify as words, or tokens. It would then be interesting to find 
out how diverse the writing is, how many different word types, i.e. specific 
words like “the” or “bunburyist,” appear among these 20471 tokens, and 
we observe that the book consists of 2649 types. Those follow an interesting 
distribution, see Fig. 4.6. While the ten most frequent words (note that these 
are short, one-syllable words) already account for more than 20% of the entire 
text, almost half of the types, 1277 to be exact, appear only once in the whole 
work. 

Zipf^s law states that a type’s number of occurrences is inversely propor- 
tional to its rank in the list of words ordered by their frequency. This nicely 
describes the statistics of the Wilde text, the more so if we don’t consider the 
first 10 words. It is striking to see how robustly this relation, valid indepen- 
dently of author or language, holds true for such a large scale. Zipf attributed 
this phenomenon to a fundamental human principle of least effort, but Man- 
delbrot later proved that a monkey hitting typewriter keys at random will 
also produce a text obeying Zipf’s hyperbolic law, cf. the discussion in [4.10]. 
It still remains a debated topic today whether the observed distribution is 
the consequence of a behavioral or merely a stochastic process. 
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Fig. 4.6. Word distribution in Oscar Wilde’s The Importance of Being Earnest. 
The table on the right holds selected word types and their numbers of occurrences 
in the text. On the left, occurrences are plotted as a function of their ranks on 
logarithmic axes 



Sentence Probability and n-Gram Language Models. Now let’s take 
a look at the probability of a sentence^. We might be motivated to do so 
to compare the likelihood of speech recognition output or to decide which 
sentence from a list of alternatives should be generated as a dialogue system 
response. As a start we could simply compute the product of the probabilities 
of all words in that utterance, but that would leave out a lot of knowledge we 

have of language. Consider the sentence “ The dog chewed on the old ” 

Immediately, words like “bone,” “couch,” or “shoe” come to our minds to 
fill the ellipsis - the preceding words provide strong clues that lead to this 
conclusion. 

Using the semantic relations between neighboring words, we could esti- 
mate a sentence’s probability by looking at each word and see how likely it 
is to appear given the preceding words. So, to express the likelihood P{) of a 
sentence, i.e. a sequence wi^rn of ^ words ici, iC 2 , . . . , Wm , we can write the 
recursion 

P(^l,m) = P{wi)P{w2\wi)P{w^\wi^2) • • • P{Wm\m,m-l) (4T) 

where P{wi) is the prior probability of word 1, P{w 2 \w\) the posterior prob- 
ability to encounter word 2 under the condition that the previous word was 

^The term sentence probability, innocuous to the naive eye, has caused quite a 
commotion in the speech and language research community; a fuss that humorously 
exposes the deep trench between the statistical-processing school and the keepers 
of the symbolic grail. Chomsky is often quoted as “But it must be noticed that 
the notion “probability of a sentence” is an entirely useless one, under any known 
interpretation of this term.” 

Statisticians have equal wit to offer: “Anytime a linguist leaves the group the 
recognition rate goes up.” (Jelinek) 
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word 1, P{w‘i\wi^ 2 ) the chance to see word 3 preceded by word 1 and word 
2, and so on. 

Conditional, a-posteriori probabilities appear in every branch of speech 
and language systems, and very often these statistics are not directly avail- 
able. Hence, we need to estimate them using other parameters. At the be- 
ginning of the language processing chain for instance, in speech recognition, 
we are mapping speech input (acoustic signal s) to linguistic data (word ic), 
i.e., we are striving to find the word that maximizes P{w\s). However, as 
explained in more detail later in this chapter, we can describe this function 
in a much more convenient way by looking at the probability of word w to 
occur in a sentence and by comparing signal s to the signal that would be 
expected were we pronouncing w. This matter can be expressed using Bayes’ 
rule: 



P{w\s) 



P{s\w)P{w) 

W) 



(4.2) 



When deriving statistics from text, it becomes apparent straight away 
that conditional probabilities as in (4.1) depending on whole phrases or even 
sentences are too hard to compute even if we restrict ourselves to small sen- 
tences. What we can do, however, is slide a window of a given number of 
words, say, three, over the texts of the corpus and check whether it is suffi- 
cient to express a word’s dependencies only by looking at its close neighbors. 
Example sentence “the dog chewed on the old bone” thus would be analyzed 
in chunks “the dog chewed,” “dog chewed on,” “chewed on the,” “on the 
old,” and “the old bone.” This idea turns out to work comfortably well and 
leads to the immensely useful n-gram structure. In this model, we formulate 
the likelihood to encounter a word as a function of its n predecessors: 



T^('^m I'^m— 1 5 ^m— 2 5 • • • 5 ^m— n+l ) • (^*^) 

Commonly, n = 3. With these trigrams, the sentence probability as written 
in (4.1) is expressed as the simpler 

m 

P{wi,m) = WP{Wi\Wi-l,Wi-2) , 

i=l 

that is, we are describing a sentence as an n — order Markov model (see 
Sect. 3.8), assuming that only the last n— 1 words affect the prediction of word 
n. These n-grams implicitly capture a language’s syntactic structure and can 
therefore serve as language model, without the application of a handwritten, 
symbolic grammar. 



(4.4) 



Part-Of-Speech Tagging. One of the most satisfying applications of sta- 
tistical language processing in terms of accuracy is part-of-speech tagging, 
i.e. finding a word category like “noun” of a word type like “ball.” For En- 
glish, even simple algorithms assign the proper category mark to 9 out of 10 
words - sophisticated ones manage up to 95-97% - and return information 
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that is crucial for further analysis steps such as parsing or focus detection. 
Why do we need to spend time and effort on part-of-speech tagging at all? 
Consider the sentence “Flies can still fly,” whose words fall into categories 
noun, auxiliary verb, adverb, verb. But a text-processing machine looking at 
its dictionary doesn’t know per-se whether “flies” is a plural noun or a verb, 
whether “can” is an auxiliary, noun, or verb, “still” an adverb, adjective, 
noun, or verb, nor if “fly” is used as verb, noun, or adjective. 

Given a sentence containing words w\^rn = rci, . . . , Wm, we are looking for 
the most likely corresponding set cj ^ of word categories ci^rn = ci, . . . , c^, 
i.e., the succession 

cl^ = argmaxF(ci, . . . ,c„|wi, . . .,Wm) (4.5) 

’ Ci,rn 

- which we cannot directly estimate straight away. The easiest thing to do 
would be to simply label each word with its most probable tag. This already 
returns the proper answer for more than 90% of the words thus processed 

[4.11]. 

Of course we can do better when we include some knowledge of language 
structure. For the sentence above, “Flies can still fly,” the sequence N - 
AUX - ADV - V is much more likely than the (also possible) V - N - V 

- N, the rules of syntax tell us so. Using Bayes’ rule (4.2), expression (4.5) 
can be reformulated as the probability of sequence ci^rn multiplied by the 
probability to encounter Wi^rn given Ci^rm divided by the a-priori probability 
of Since the sentence probability P{wi^rn) does not change for different 

it suffices to maximize the product of the first two terms, and we obtain 

= argmaxP(ci, . . .,Cm)P{wi ,. . . ,rcm|ci, . . .,Cm) • 

Cl , m 

Just as we did in the language model (4.4), we approximate the first term, 
P{ci,m), with bigrams or trigrams (4.3). This time however, we assume that 
a word category^ rather than a word type, depends only on the previous 
n categories, and express language structure as n-grams in the product on 
the left-hand side in (4.6). For the second term, we ignore any dependencies 
between the individual wi\ci pairs and write P{wi^rn\ci,m) as the product of 
these pairs to find the desired categories as 

m m 

= argmaxjjp(ci|ci_i,...,ci_„+i)]4p(wi|cj) . (4.6) 

i=i i=i 

Another popular approach to part-of-speech tagging is to use hidden 
Markov models (HMMs, see Sect. 3.10) which offer the advantage to be 
a well- understood, intuitive representation of the matter and come with a 
handy cookbook of powerful algorithmic recipes (cf. [4.12]). A common way 
to model the task as HMM is to describe word categories as hidden states, 
word types as observables/output of those, and transition probabilities as 
category bigrams. Often, this design is extended by using trigrams instead. 
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Word category n-grams are a fairly uneconomical syntax model; they can 
only look back at the n — 1 preceding words, and they require substantial 
memory because all n — 1 categories are stored, even if only some of them 
would be necessary to store the desired information. Even worse, the related 
word might not even be in the n — 1 neighborhood of the type under inves- 
tigation. A simple but clever part-of-speech tagger based on transformations 
overcomes this limitation [4.13]. Initially, each word is assigned its most likely 
category tag. Then, a row of substitution rules change a label if it appears 
in a trigger environment. These rules could read “change tag X to tag Y if 
the previous tag was Z” or “change A to B if one of the last three was C and 
the next but one is D,” and so forth. Transformation rules offer substantially 
more flexibility than n-grams and can also (but need not) be automatically 
generated by training on unambiguous syntactical structures in the corpus. 

Clustering. As we saw in the paragraph on semantic hierarchies, it is often 
useful to group related words, for instance words that have similar meanings 
or functions, or demand the same syntactical structure. Suppose we were 
building a language understanding unit for a dialogue system and wanted 
to code the notion of time values. We certainly would not code every pos- 
sible time explicitly in example phrases like “be there at 6:30,” “be there 
at five,” or “be there at 4:20,” but rather teach the parser a concept like 
timeOfDay and handle the phrase as “be there at <timeOfDay>.” The aim 
of clustering is to find these concepts automatically, by unsupervised corpus 
analysis, to get a better understanding of the present word types and thus 
improve language models, build thesauri or dictionaries, and improve word 
sense disambiguation. 

This process is done in two parts. At first, we need to decide which word 
characteristic we want to obtain and how we quantify this information so we 
can compare and order values of the same parameter between all words later 
on. So for each word, we are creating an n-dimensional vector containing the 
n parameters we are looking for. After we have collected the data, we are 
then concerned with grouping the data to meaningful clusters in the second 
step. To do this, we define a metric that enables us to quantify how close two 
words are in the n-dimensional parameter space, and a clustering algorithm 
that decides how to relate the words. 

Generating the word vectors - the first step - can incorporate anything 
from assembling bigrams for each word in relation with each other word, 
counting co-occurrence of words in 20-token windows, quantifying verb-noun 
relations, or finding out whether a word has more ‘e’s than the previous one 
or appears within 250 tokens from the type “Belgium;” whatever the task 
demands. 

EM Algorithm. Step two, the actual clustering, often involves more extrav- 
agant methods. The EM (for Expectation-Maximization) algorithm for in- 
stance is a commonly encountered mechanism. Here, we assume the data is 




4.1 The Structure of Language 



83 



generated by some hidden model and fit its parameters so they best explain 
the observed data. For instance, we could assume the words are distributed 
over k Gaussian clusters, and then adapt means and variances of our hid- 
den model to best explain the observed data. Often, there is no way to find 
the model parameters algebraically; therefore, the EM algorithm performs a 
greedy iterative search to optimize the parameters. This is done in two steps: 
Expectation - assuming we know what the model looks like, we compute the 
probabilities that cluster i generated data point Wj. Note that words have 
some nonzero probability to belong to each cluster. This way, words can be- 
long to more than one semantic field, i.e. are allowed more than one meaning. 
Maximization - assuming we know cluster memberships, find the model pa- 
rameters - for Gaussians, the covariance matrix - that has the maximum 
likelihood. With this iterative process, the likelihood of the model given our 
data is guaranteed to increase (or remain at a maximum). 

k-Means. If we’re not really interested in entitling a data point to member- 
ship in more than one cluster, we can keep the Gaussians infinitely slim and 
postulate each word belong to just one group. This is what the k-means algo- 
rithm does: Begin with a fixed number of clusters containing words assembled 
around a cluster centroid. Assign each word to the cluster with the closest 
(e.g. by an Euclidean measure) centroid, and finally recompute the centroid 
positions as the centers of gravity (as the sum of the cluster’s word vectors 
and dividing the resulting vector by the number of words in that class) of all 
cluster member words. Repeat this process until you are satisfied with the 
grouping of the data. 

Mutual Information. As the last clustering algorithm we discuss a method 
based on the information-theoretic concept of mutual information. Let’s say 
we are interested in finding words that appear in the same context. Brown 
et al. [4.14] used mutual information as a measure of how word types and 
the vectors representing them are correlated; a concept that examines how 
much information one variable gives us about another one. In general, the 
mutual information /(ici, . . . ,iCn) between n variables describes how much 
more information is contained in the sum of the separate variables than in 
the combined distribution of all ics - i.e. how much higher the sum of all 
entropies^ is than the entropy of the entire distribution. The point-wise mu- 
tual information between two specific word types in a corpus can be written 
as [4.11]: 

'^In information theory, a variable’s entropy H(w) can be understood as a mea- 
sure of the amount of information given by that observable, and the mutual informa- 
tion of n elements can be expressed as I{wi, . . .Wn) = H{wi) — H{w\^ . . . , rcn). 
One can think of entropy as the observation’s minimal code length. If the ic^s are 
independent, the number of bits needed to represent w = wi, . . . , Wn, i.e. all WiS in 
their entirety, is simply the sum of the required bits to encode the discrete wis. If 
they have mutual information, these common bits can be reused and the code for 
the combined set w becomes smaller. 
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I{wi;w 2 ) = log 



P{wi,W2) 



P(Wi)P{W2) 

which leads to the idea of average mutual information of two random variables 
W,W', 



I{W;W') = 



i=l k=l 



that expresses how much mutual information, on average^ two words te, w' 
in clusters VP, VP' of m words have. In this case, we are interested in the 
minimal loss of average mutual information as new word vectors are added 
to one cluster, so VP = VP'. 

Brown now proceeds as follows: Starting with a large number of clusters, 
if computational power allows, one cluster per word, the average mutual 
information for adjacent classes is evaluated. Then those two clusters whose 
combination will result in the smallest loss of average mutual information are 
merged. Once the number of classes is acceptably low, each word is moved 
to the cluster for which the resulting class has the greatest average mutual 
information. 



Probabilistic Context-Free Grammars. Context-free grammars (CFGs) 
as introduced earlier in this chapter are a handy way to structure a sentence, 
see Fig. 4.2. Often, a grammar associates more than one proper structure to 
a phrase, especially if a large number of rewrite rules to extend a node exist. 
Consider the sentence “Bob saw the man with binoculars,” possibly meaning 
that “with binoculars” describes the man Bob saw. Alternatively, it could be 
interpreted as an adverbial phrase determining that Bob saw the man through 
binoculars, leading to a different parse tree. In this case, a sentence that is 
both syntactically and semantically correct offers two different meanings. 
Since CFGs do not consider the actual words, i.e. sentence semantics, but 
merely word categories such as N, V, ADJ and so forth, they often generate 
a vast number of syntactically valid structures for the same phrase, especially 
if its words are members of multiple categories. To illustrate this, just try to 
find all grammatically (not necessarily semantically) valid parse trees for the 
phrase 



a (ART V N V PREP V AUX) % ( VV N V ADJ) likeS (y v N) Ai^S ( VV N) • 

Equivalent to the maximum mutual information approach, the number of bits 
necessary to code all elements in a cluster, as a measure of information and therefore 
entropy, can be used to evaluate how well the elements of a class fit together: 
the shortest possible description in the optimal code guarantees maximum mutual 
information. If we now also find a code to describe the group of all clusters and look 
for the separation into classes that yields the shortest description, we have a method 
to optimize both the number of clusters and the distribution of elements into these 
classes at the same time. This concept is known as the minimum description length 
method. 
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In cases where we have to choose between multiple grammatically correct 
structures, a legitimate guess would be to pick the one with the highest 
probability. To find these, we use regular context-free grammars and now, for 
each rewrite rule, remember how likely each rule is to be used. We need: 

- a set of terminal symbols (words) a% z = 1 , . . . , L 

- a set of nonterminal symbols , j = 1 , . . . , n 

- a start symbol 

- rewrite rules that map nonterminals to a string 7 of terminals and nonter- 
minals: AJ ^ 7 ^ 

- rewrite probabilities satisfying the condition that the probabilities of all 

rules that expand the same nonterminal sum up to 1 : P(A-^ 7 ^) = 

Vj 

These elements form a probabilistic context-free grammar (PCFG). 

Again, we learn the required statistics by looking at a large number of 
examples, in this case parse trees just like the one in Fig. 4.2. We find these 
syntax- annotated sentences in a treebank such as the commonly used Penn 
Treebank [4.15]. These statistics allow a parser to assess how likely a struc- 
ture is to produce a given string: the sentence/tree probability is simply the 
product of all rewrite-rule probabilities in the tree. 

A little care is in order as these probability values are considered. Looking 
at the way the probabilities are attributed to rewrite rules, it is apparent that 
nonterminals that generate only a small number k of strings will usually be 
favored since the strings thus produced will have relatively high probabilities. 
Also, trees with small numbers of nodes will, on average, receive a higher 
score since the product of two probability values is always less than or at 
most equal to 1 . 

While a CFG only allows the binary answer whether a sentence can be 
produced given the grammar or not, its probabilistic extension can also assess 
how likely a sentence is (as the sum of all possible parse tree probabilities), 
and therefore serve as a probabilistic language model. However, since a PCFG 
considers only the syntactical structure and knows nothing about semantics, 
it usually performs worse than an n-gram model. 

One way to induce a good bit of further language capacity is to lexicalize 
the grammar: when the structure statistics are gathered from the treebank, 
each terminal and non-terminal symbol is now also characterized by its head 
word, and likewise, each node in a parse tree is annotated by its head. With 
these head-lexicalized context-free grammars, we are able to consider feature 
structures such as the aforementioned verb sub-categories. If we see a verb 
phrase headed by “to hand,” the head verb is usually followed by two noun 
phrases, and therefore rewrite rule 

^P(head: hand) ^ A^(= hand) ^P(head: x) ^P(head: y) 
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is much more likely than other ones expanding this node. Once this knowledge 
is included, unlikely structures can be ruled out and the parse tree forest 
disambiguated. The relation of head word and its dependants is also used 
in dependency grammars^ cf. [4.16]. In these, the parse consists entirely of 
the hierarchical connection of words. Each dependent word is connected to 
its head which in turn is connected to its head again until the sentence head 
word is reached. 

A range of more powerful rewrite grammars as well as unification gram- 
mars and their probabilistic - weighted if you will - extensions exist that 
incorporate more than purely syntactic information. Key issues are the inte- 
gration of lexical information, morphology, feature structure, sentence func- 
tions (such as subject, object), and so forth, see e.g. [4.6]. 

Neither a 100% statistical representation nor a purely symbolic descrip- 
tion would be able to capture language in its entirety. Statistical methods 
are usually easier to implement and their applications often yield better re- 
sults than symbolic systems, see for instance the comparison of different ap- 
proaches to machine translation in the Verbmobil project [4.17]. However, no 
one would really claim that an n-gram model actually achieves anything like 
a deep language understanding. 

Symbolic grammars come closer to this goal, some researchers even believe 
that language is represented in this very way in our minds. Even if this were 
true, there would certainly be a statistical concept firmly embedded in it. 
In practice, speech and language processing applications benefit from the 
combination of symbolic and statistic techniques just as we have seen with 
head-lexicalized probabilistic context-free grammars. The next part outlines 
how these methods can be put to use in the field of speech dialogue systems. 



4.2 Speech Dialogue Systems 

Speech applications available today implement the signal processing, lan- 
guage analysis, and knowledge representation techniques presented in this 
book to create human-machine interfaces that allow a speaker to interact 
with a computer in a natural manner. Speech dialogue systems, also referred 
to as spoken dialogue systems, have to meet the user’s demands for con- 
text interpretation, negotiation ability, flexibility of language and interaction, 
production of cooperative responses, adequacy of response style, user adapta- 
tion, and much more. The following sections give an overview of fundamental 
concepts in natural language speech dialogue systems, outlines common ar- 
chitectures and their functional elements. 




4.2 Speech Dialogue Systems 87 



4.2.1 Demands of a Dialogue System 

The ultimate goal is to build a dialogue system that is helpful and com- 
municates in a natural manner, as naturally as a human dialogue partner 
would. 

Natural Dialogue. In 1950, Turing, considering the question, “Can machines 
think?” [4.18] concocted an experiment he called the imitation game in which 
an interrogator (I) communicates with two other players (X, Y) located in a 
different room, a man and a woman. The interrogator tries to find out by ask- 
ing questions whether X is the male or the female player. All communication 
is entirely verbal, i.e. only text is exchanged, and I doesn’t hear or see the 
other players nor has he any other source of information. X and Y need not 
respond truthfully; their goal is to cause I to make the wrong assumptions. 
Turing argued that if a machine takes the part of X or Y in this game and 
performs as well as a human player, implying that if the machine’s responses 
were indistinguishable from that of a human, it is safe to deem the machine 
thinking. This criterion is now commonly called Turing test. 

Not much later programs appeared that achieved a considerable degree 
of naturalness with a very limited extent of artificial intelligence, entered as 
hardcoded rules. One of the first and most famous systems was Eliza, de- 
veloped by Weizenbaum in 1966 [4.19]. Designed to model a psychotherapist 
who is encouraging a patient to talk about himself, Eliza picks up certain key- 
words and sentence patterns, reformulates input phrases and has a number 
of default questions to generate dialogs like the following: 

Eliza - Hello, how are you? 

Human - I got the blues. I lost my job, lost my money, and now my wife 
took off with my truck. 

Eliza - Who else in your family took off with your truck? 

Human - She took them kids, too. And the darn dog. 

Eliza - And what do you think does that suggest? 

Human - I guess that means I am bummed. 

Eliza - So you don’t like being bummed, do you? 

Human - Are you trying to cheer me up?? 

Eliza - Do you fantasize about me trying to cheer you up ? 



Today, computer chatterbots, many of which - such as Eliza - are available 
online and for download on the Internet, are able to carry on a conversation 
in an astonishingly natural way by mimicking the mechanics of a human-to- 
human conversation. 

Effective Dialogue. To be of practical use, however, a dialogue system needs 
to accomplish more than simply adopt a natural guise. We want to cooperate 
with a system to achieve a specific goal. A communicating agent, human or 
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machine, who wishes to engage in a cooperative dialogue needs to embody 
a number of principles defining the system’s processes and cognitive state. 
Think of a call center agent who is asked to look up a phone number. First of 
all, the agent needs perception to realize the desired task. He has a database of 
names and corresponding telephone numbers and knows how to handle user 
requests. This knowledge forms his belief. The agent wants to help (desire) 
and thus decides to assist the person (planning, reasoning). To do so, he 
forms the intention to look up the name of the desired person and tell the 
caller. He needs eommitment to carry out his intended task, and then acts 
accordingly, looks up the number and returns it to the caller. 

This belief/ desire/intention (BDI) model likewise forms the foundation 
for a dialogue system. It receives or perceives input in the form of a user 
request and has a database attached to it that describes the part of the 
world the machine needs to know to carry out its tasks. It had better be 
programmed to be willing and committed to help, and when it sees it can 
handle a request, it should plan to do so. If everything runs smoothly, it will 
then come up with a strategy to meet the demand and act accordingly. 

Cooperative Dialogue. As outlined in the first part of this chapter, perception 
requires analysis on multiple levels. Peter Sellers, playing Inspector Clouseau 
in the comedy “The Pink Panther,” asks a local in the streets of a Swiss city: 
“Do you know the way to the Palace Hotel Palace Hotel?” The man answers 
“Yes.” and walks away. Clearly, to make use of the perceived information, 
it is also necessary to model the beliefs of the communication partner and 
understand his/her intentions. Grice summarizes the rules of a cooperative 
dialogue in 4 maxims [4.20]: 

- Quantity: communicate as much information as useful, not more 

- Quality: make your contribution one that is true 

- Relation: the information given should be related to the discourse context 

- Manner: expressions should be clear, brief, and unambiguous. 

Communication Modalities. In most cases, a dialogue system provides the 
mapping from user input to the underlying functionality. (The latter is de- 
scribed in the system’s activity model.) Consider a simple car navigation 
system that guides a driver to a specified destination. In this case, the devel- 
oper needs to design an interface that allows the user to enter the names of 
a city, street, street number, specify route options, and start the navigation. 
Using speech as the input modality allows the driver to keep his hands on 
the wheel and eyes on the road while entering the destination, so the pro- 
grammer might opt to build a speech user interface to map user utterances 
to the navigation engine’s functions. 

While spoken input offers the advantage that many pieces of information 
needed to start an application can be entered at once, e.g. “drive me to 742 
Evergreen Terrace, Springfield” rather than clumsily entering the address 
with a dial-and-push button, it might be useful to offer the tactile interface 
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as a backup when the automatic speech recognition performs poorly. On the 
output side, using speech commands like “at the next intersection, make a 
left” to guide the driver is safer than having him look at a screen, but when 
defining a destination address, a display showing a map can help to visualize 
the entered destination. In this case, the programmer may decide to build a 
dialogue system using auditory, visual, and haptic perception and needs to 
come up with a design that allows to integrate all these modalities into one 
framework. These devices are called multimodal interfaces. 



4.2.2 Architecture and Components 

Very few institutes or companies program all parts of a dialogue system them- 
selves nowadays. In most cases, off-the-shelf products such as commercially 
available speech recognition and text-to-speech engines are combined with 
one or more self-written system parts to build the desired application. This 
imposes two strong demands on system design: firstly, we require a modu- 
lar architecture that clearly separates all components, and secondly, we need 
well-defined interfaces between functional parts. 

Current natural language speech dialogue systems usually contain the fol- 
lowing chain of components: speech recognition, natural language processing, 
discourse engine, response generation, and speech synthesis. This modular 
design leads to a centralized structure (cf. for instance the Galaxy/DARPA 
communicator’s hub-and-spokes layout [4.23]) in which components register 
at a central system manager. The following sections outline how the infor- 
mation is processed on the way through these modules. 



4.2.3 How to Wreck a Nice Beach 



or did you say “how to recognize speech?” The front end of a natural language 
speech dialogue system is the automatic speech recognition (ASR) unit. It 
is occupied with the veritable challenge to convert the incoming signal into 
text, finding, out of all word strings that would sound similar to the observed 
acoustic evidence, the one representing the actual utterance. That is, given 
the acoustic signal to find the most likely word sequence W that maximizes 
the probability to encounter W given S: P{W\S). Since this probability is 
hard to obtain, Bayes’ rule (4.2) is used to rewrite 



P{W\S) 



P{S\W)P{W) 

P{S) 



and since the denominator P{S) remains the same for all word sequences IT, 
this factor can be ignored and we now strive to find the one string W that 
maximizes P(IT|5), i.e. 



IT = argmaxP(IT)P(5|IT) , 



(4.7) 
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Fig. 4.7. A prototypical natural language speech dialogue system. First, the user 
utterance is turned from speech signal into text by the automatic speech recognition. 
The natural language processing unit parses the utterance and extracts its meaning. 
The discourse engine decides which dialogue step to take, e.g. negotiate alternatives, 
disambiguate input, or confirm a hypothesis. (Often, this part of the system is called 
dialogue manager, however, some authors refer to the whole system as dialogue 
manager. We shall identify this component as discourse engine to avoid confusion.) 
The chosen discourse action is the basis for the system answer which is generated 
by the response generation module. Its output, a text string, is converted into an 
audible response by the speech synthesis. After each turn, the gained information 
is added to the information state 



where P(LF), the probability to encounter word sequence W, can be obtained 
from the ASR’s language models and P{S\W), “given sequence LF, what’s the 
chance it sounds like 5?” is estimated by the acoustic model This is quite a 
challenge indeed, considering that it possibly needs to work with a variety of 
different speakers, uttering continuous, difficult to segment speech, in noisy 
environments, using a large vocabulary. 

A popular method for small- vocabulary, speaker-dependent speech recog- 
nition - for instance for mobile phone ASR applications - is dynamic time 
warping. The user records 10-50 spoken commands and stores them in the 
phone’s memory. Each time he or she now articulates a voice command, the 
speech recognizer stretches and crunches - i.e. warps - the incoming signal 
to see if it looks similar to anything it has stored. 

For /ar^e-vocabulary, speaker independent speech recognition, this par- 
adigm won’t do. In this case, hidden-Markov-model methods as illustrated 
in Sect. 3.10 are used. In a nutshell, the process works like this: First, the 
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speech signal is divided into manageable time slices of, say, 10-20 ms length. 
This data is turned into a representation that yields meaningful features for 
pattern recognition. Commonly, the signal is transformed into the frequency 
domain and expressed as feature vectors (typically 8-30 dimensions) in terms 
of their mel- frequency cepstral coefficients, linear prediction coefficients, or a 
mixture of Gaussians. Since we have a finite set of acoustic models (often 100- 
200, usually hidden Markov models), each continuous, real- numbered feature 
vector is mapped to its best-fitting acoustic model, thereby clustering, i.e. 
quantizing the vectors. From these acoustic models, possible output words 
are generated by matching the recognized acoustic evidence to the phonetic 
descriptions of the words in the lexicon. The resulting word sequences are 
evaluated by a language model (e.g. a bigram model) to see how probable 
the word strings are in the given language, and then output either as best 
sentence or, producing more than just the most likely word sequence, as word 
hypothesis graphs see Fig. 4.8. 




Fig. 4.8. A word hypothesis graph returned by the speech recognition component 
for the utterance “how to recognize speech.” The correct word string was recognized 
as best sentence, alternative hypotheses are also returned. Other possible sentences 
that can be formed from this lattice are for instance “how to wreck a nice beach” 
or “how do we cognize peach.” Each node marks one point in time, so all words 
that begin after the same node are temporally aligned. Not shown here are the 
annotated word confidences and transition probabilities 



The workhorse of speech recognition is the hidden Markov model (HMM) 
discussed in Chap. 3, see also [4.12]. It serves as the standard phonetic/acous- 
tic model. To build these (cf. [4.24]), one decides which set of phonetic sym- 
bols to use, for example the IPA or SAMPA alphabets mentioned earlier this 
chapter. Each symbol in the phonetic alphabet is described by one elemen- 
tary HMM. To phonetically characterize a word, one then concatenates all 
these elementary building blocks corresponding to the phonetic description 
of that word to form one larger HMM that now describes the whole word. 

As noted on page 69, a single phoneme can sound different, having multi- 
ple phones associated with it. These allophones of the same phoneme usually 
depend on the context^ that is, the phonemes just before and behind it. Simi- 
lar to trigram language models, a standard way to describe phonetic context 
is to use triphones as an acoustic model. 
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4.2.4 Natural Language Processing 

Once the speech recognition part has delivered a string of words, the next 
challenge is to extract the utterance’s meaning. This is the task of the natural 
language processing (NLP) unit, also called natural language understanding 
component, or NLU, for short, and as the name suggests, this is the module 
in which the concepts described in the first part of this chapter are brought 
to bear. 

Usually, the NLU processing starts off with some sort of parsing to find 
the structure of an utterance. Figure 4.9 shows a chart parsing algorithm 
[4.3] that applies a context-free grammar to the sentence “The guys can 
play soccer.” It correctly identifies “The guys” as noun phrase and “can play 
soccer” as a verb phrase consisting of an auxiliary verb, a verb and its direct 
object. However, the three different functions of “can” - auxiliary verb, verb, 
and noun - and the two classes of which “play” is a member, noun and verb, 
create unintended symbols like VP: “can” V “play” N that eventually lead 
to a second sentence structure besides the one desired. It becomes apparent 
straight away that if a very simple grammar like the one in this example 
returns multiple interpretations, we will be lost in a whole forest of possible 
syntax trees when we use a more natural, complex set of rewrite rules. 

Valid and proven ways out of this dilemma are through the employment 
of feature structures, statistics, or lexicalization as outlined earlier in this 
chapter, but these do not entirely eliminate all parsing problems. Speech 
recognition systems are far from perfect, and with many speech dialogue 
applications working in acoustically adverse conditions like in a car or over 
the phone, word error rates can be arbitrarily high and will play havoc with 
any parser expecting perfectly correct sentences. Even if speech recognition 
were to yield a 100% accurate transcription of an utterance, the ungram- 
maticalities of spontaneous speech would still, more often than not, prohibit 
structuring an entire sentence from start symbol to terminals. 

Fortunately, this is not necessary for many applications. It often suffices 
to find well-behaved clusters of adjacent words in a sentence, so-called islands 
of confidence, and parse these as far as possible. If the information necessary 
for the dialogue is found in these regions, chances are an island parser can 
still extract this information in a useful way. An example of an algorithm 
that can handle such fragmentary input is the chunk parser (cf. [4.25]). It is 
made up of several layers of finite state grammar rules that combine identified 
symbols to the next higher- level symbols from bottom - terminal symbols, 
word categories - to top - S, sentence start symbol. The grammar doesn’t 
allow recursive rewrite rules, at each step, the parse propagates to the next 
level. Symbols that don’t fit are simply left unchanged. In this manner, regu- 
lar structures can be analyzed without the inconsistent parts disturbing the 
process too much. 

While this method operates strictly bottom-up, i.e. from terminals to start 
symbol, this is not inherent to all parsers. Tree- adjoining grammars (TAGs, 
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Fig. 4.9. The chart parsing algorithm, applied to the sentence “The guys can play 
soccer.” At step 1, the first token, “the,” is looked up in the lexicon and recognized 
as an article, and the symbol ART is noted above the word. The parser’s grammar 
furnishes one rewrite rule whose right-hand side begins with an article, NP ^ ART 
N. This rule activates an open arc in the parser’s agenda^ here displayed as an 
arrow. The dot in ART o N indicates that the symbols to its left, here, ART, have 
been identified in the text string, while the ones to the right of the dot still need 
to be found for the rule to be completed. At step 2, “guys” is added and identified 
as noun, thus the NP rule is successfully traversed and an NP signal added to the 
chart above the text. The noun phrase itself is the first symbol on the right side of 
the grammar’s S ^ NP VP rewrite rule, so an active arc for this rule is added to the 
agenda. This algorithm of matching rules to words in the agenda and writing the 
symbols corresponding to completed rules to the chart is repeated until all words 
have been processed or no matches can be identified 
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see [4.26], and [4.27] for a lexicalized version) operate like a tree building tool 
kit, major assembly required. The analysis begins with a basic initial tree 
structure such as S ^ NP VP. The parser knows a set of these basic trees 
and has rules to expand them by adjoining, that is, inserting, other simple 
sub-trees such as NP ART NP to form more complex trees - here, S — > 
(NP— >ART NP) VP - until the final syntax structure of the sentence has 
been found. 

In many cases, it is even sufficient to leave out syntactic analysis alto- 
gether and merely look at interesting keywords and their meaning to build 
the semantic representation of an utterance [4.21]. This is the case in most 
frame/slot-filling applications (see p. 97), especially if the initiative lies with 
the system. If a fiight information dialogue engine asked a user with which air- 
line he would like to travel and the answer contained the word “Lufthansa,” 
even if the ASR has garbled the text string to a point that parsing is impos- 
sible, accepting the word with matching semantic type to fill a slot is usually 
a comfortably good guess. 

Of course, the user could have said “oh just any one, but not Lufthansa,” 
but most of these exceptional answers can be correctly interpreted if a small 
number of rules are added, for instance, to catch negations. These rules can 
be written as fiat, i.e. non-rewriting mini- grammars, containing information 
of the kind 4f a “not” precedes a semantic item, give it a negative evidence.’ 
These sub-grammars are also useful to capture the different possible forms of 
concepts like time phrases - five o’ clock, five p.m., seventeen hundred hours, 
ten after four, etc. - dates, prices, and so forth. Rules of this kind are often 
expressed as syntactic/semantic templates or regular expressions. 

The utterance just considered was a response to a system query. Other 
types of utterances that can be expected in a dialogue system are user ques- 
tions, commands, confirmation to a system question, and so on. These dif- 
ferent kinds of expression are distinguished as different speech acts^ or more 
precisely, illocutionary acts [4.28] (as opposed to perlocutionary acts that de- 
scribe the production of an effect in the addressee of the utterance). Their 
recognition has an impact on all following components: first of all, it tells the 
discourse engine what kind of dialogue step to plan, for instance, whether it 
should answer a question or query the user for a missing slot; the response 
generation will form its answer accordingly, e.g. as an informative answer, 
question, confirmation; and the text-to-speech module can adapt its prosody 
in agreement with the system’s dialogue act. 

Once the parsing, i.e. the structural analysis has been done, the extracted 
data needs to be investigated with respect to its meaning and understood 
in the context of the dialogue. Natural language offers some challenges to 
this interpretation task. To name just two, we have to deal with anaphora 
or reference resolution. When we refer to a discourse entity, say, a fiight 
destination city, we do not necessarily mention it again explicitly: “There’s 
a fiight to San Diego at 6 p.m.” “OK, I want to go there.” To understand 
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the anaphoric “there,” we must investigate antecedent items in this context, 
locations from previous utterances that could potentially be referenced here. 
A possible next user utterance could be “And then to Cabo San Lucas,” 
where the complete sentence would read “And then, I want to fly to Cabo 
San Lucas.” Still, the sentence is perfectly understandable because we are 
used to speaking using ellipses, that is, omitting a word or phrase that is 
not required to support the intended statement. In general, we only express 
new information that modifies already agreed-upon, grounded facts and do 
not restate established knowledge in each turn (see [4.29]). The contextual 
information part of the NLU therefore also needs to assign constants to mark 
objects like a particular destination city “city 1: San Diego” to be able to 
refer to it in later utterances. To keep track of this, a system needs memory 
to store established data, i.e. the context of the conversation. The collection 
of all stored information is often referred to as the system’s information state. 

The final semantic representation can then be put in the context of previ- 
ous utterances and expressed either as a frame (see below) , in logical form (if 
it is a purely semantic account), quasi-logical form (a similar format including 
multiple, ambiguous readings, for instance word sense or syntactic/scoping 
ambiguities, as well as contextual constraints, in the same expression), or in 
another system-specific format. 

4.2.5 Discourse Engine 

This module plans the actual dialogue strategy. It is this part of the system 
that is responsible for a natural and efficient dialogue. Evaluating the infor- 
mation state, the context, and the current utterance, it decides whether it 
should for instance confirm something the system has understood but is not 
certain about, disambiguate two possible actions, give an answer to the user, 
and so forth. 

To illustrate the idea of a dialogue strategy, think of a situation in which 
human ‘speech recognition performance’ would worsen, e.g. at a noisy bus 
station. If one were not sure what someone else had just said, most likely, 
one would start a line of questions to clarify uncertain points. Similarly, if 
a system’s ASR unit delivers text with shaky word confidences, different 
strategies depending on the quality of the data would be applicable to obtain 
missing or unsure items. To some extent, this can even remedy the effects of 
a poorly performing speech recognition engine. 

State- Based Systems. In the following paragraphs, let’s consider how a spe- 
cific dialogue can be defined. We’ll start with a simple method using finite 
state automata as introduced earlier. These structures can be employed to 
work through a dialogue in a step-by-step fashion. In the example below, the 
user is asked to enter an address for a navigation task. At first, the NAVI fi- 
nite state machine is invoked by the command to begin a navigation dialogue 
and starts in the initialization state. It prompts the user for the name of a 
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city. Once that is entered, it proceeds to state NAVI 1 and asks for a street. 
Given the information it requires at each step it moves through all states until 
it reaches the NAVI end node and the FSA has been passed successfully. 



City Street Number 




Fig. 4.10. A finite state automaton to enter an address 



Single finite state automata can be combined on different levels, for instance, 
there might be another FSA that describes a sub-dialogue how to ask for a 
city in multiple steps, asking for the name first, then for the zip-code so it can 
catch ambiguous entries, etc., so when the city node is activated, it submits 
the job to the city FSA, and once that is successfully traversed, it bounces 
back and returns to the initial automaton. 

This is an easy and straightforward framework to gather the information 
needed to start an application. The initiative remains with the system, i.e. 
the machine asks questions, the user answers. This way - assuming the user 
is cooperative - the developer always has only a small, manageable number 
of possible responses to deal with and can therefore, at each point in the 
dialogue, apply speech recognition and understanding units that are fine- 
tuned to a particular question with a highly limited vocabulary and grammar, 
thereby increasing the robustness of the system. Most automated call-center 
dialogues are based on this framework. 

Frames and Slots. While the rigid dialogue flow of state-based systems makes 
them comfortably robust, it doesn’t allow the user much flexibility, since the 
discourse can only follow the predefined steps of the finite state automata. 
Rather than describing dialogue flow and predetermine the sequence in which 
the required data is gathered, it is also possible to just write down the de- 
sired items and leave the order of their mentioning either to the user or to 
the system, i.e. allowing an arbitrary sequence. The necessary information is 
commonly stored in slots of a semantic frame. A frame usually either collects 
all the data carried out to start an application (see for instance the RoBoD- 
iMa (Robert Bosch Dialogue Manager) toolkit [4.21]), or holds semantic data 
of an utterance, system response, or discourse step, as implemented in the 
Mercury flight reservation system [4.22]. The slots are its atomic data vehi- 
cles, one can consider them attributes that assume a value of a specific kind, 
see Fig. 4.11. 

Accumulating the information this way allows a more flexible, mixed- 
initiative discourse: both the machine and the user can initiate the slot Ailing. 
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Frame 1 

Action navigateTo 

DialogObject location 
Slot city; Springfield 
Slot street: ? 

Slot St #: ? 



Fig. 4.11. An example frame from a navigation task. In an activity model of this 
kind, a frame is constructed from the combination of an action - here: navigateTo 
- and a physical object as its argument, in this case a location. Slot city has already 
been filled by user input, street and street number are empty 



Domain Description - Ontology. One can also teach the machine what a 
domain looks like and then allow the capability to engage in a dialogue in 
this domain to evolve from its description of the world. To define a domain, 
one can look at the functions and the physical objects that are manipulated by 
these functions, then derive a set of actions that describe these functions, and 
objects that are the arguments of these actions [4.21], see Fig. 4.12. Objects 
are defined by their attributes, for example by slots that can be filled with 
semantic items such as a street name, the name of an airline, a time string, 
etc. This domain description needs to be linked 1:1 to the activity model, so 
that a filled frame precisely defines the application it starts (cf. [4.31]). 

A slot in turn can be defined by the type of the entity that fills it, it can 
have a notion whether its value is required to start an action or if it’s an 
optional value, it can tell the response generation how to prompt the user for 
the value of this slot, and so forth. Slots are often linked to a specific class of 
entries in a system’s database, for example to the list of available destination 
cities in an airline travel dialogue. 

Separation Domain Knowledge - General Discourse Competence. If done 
properly, this description of the world, the system’s ontology, is all the di- 
alogue manager needs to engage in a discourse about the given domain. 
Domain-specific knowledge is separated from the knowledge of how to reach 
a goal in conversation. The former is entered by a developer who designs an 
application, i.e. adapts the dialogue system to a new domain, the latter is 
the discourse faculty that is inherent in the system and needs not be touched 
- as long as you know how to negotiate, disambiguate, ask questions and so 
forth, it doesn’t matter if the topic is airline reservation or football, the abil- 
ity to acquire and handle information in a conversation remains the same. 
This way, a system can be adapted to a new domain in a straightforward 
manner, without the need to actually describe each step in the dialogue flow. 

With a design of this kind, the user can say anything at any time and 
is not ushered through a series of predefined steps. The domain definition 
for a specific system is usually carried out by the application developer with 
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Action goTo 




Action getinfo 




Action reserve 





► extends - relation 



Fig. 4.12. The world description for a small navigation domain, creating a semantic 
graph similar to the one in Fig. 4.5. Three basic functionalities, drawn at the top 
of the figure as action boxes, were identified for this domain. These work with the 
dialogue objects (DO, see [4.21]) displayed below the actions. The basis object 
is called location, it owns three slots. Objects public transport stop and point of 
interest inherit all slots from location, and in the POI case, add more of their 
own. Restaurant and hotel are in turn derived from the point of interest DO. This 
inheritance is expressed as extends relation. The owns relation describes that one 
entity is part of another one. The slot city, for example, is property of the location 
object, city in turn owns slot street, and street number is part of street 



the help of dialogue builder GUIs, through XML, or with a custom scripting 
language. 

Once the intention of a human speaker has been recognized, it is the 
discourse engine’s task to gather all the required information to start an ap- 
plication and generate a response. Rarely is the data acquired this way clear 
and unambiguous, and it is the discourse engine’s duty to find the proper 
information through turns of negotiation, list selection, disambiguation, con- 
firmation, possibly relaxation of constraints, and so forth. Consider a user 
utterance “call John” in a voice-operated phone application and two result- 
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ing speech recognition output hypotheses “call John” or “call Don.” Or a 
flight reservation system that understands “Paris” but doesn’t know whether 
the city is the starting point or the destination of a flight. (The former is an 
example of a value ambiguity^ two values, John vs. Don, are candidates for 
the same semantic slot, the latter is a position ambiguity, i.e. the system is 
unsure to which slot, start vs. destination city, the item belongs.) 

Frame Trees. One chief task in the data gathering process therefore is the 
representation and resolution of alternative semantic objects. Here, frame 
trees, or application trees (as documented in the Bell Labs Communicator 
[4.30]), prove a convenient method, see Fig. 4.13. This structure, based on 
decision trees borrowed from artiflcial intelligence, contains frame candidates 
of an activity model: a completed path from root to leaf constitutes one frame, 
i.e. one fully specifled activity that can be passed on to the linked device. 
In the example below, a frame consists of an activity, for instance, ‘give 
information’ and an object, e.g. a flight. Each time a conflicting candidate is 
created, a branch with the new hypothesis is attached to the tree. 



I flight I 



/\ 

I date/time I | airport 1 
? \San Diego] 

^ ICaboSL] 




flight I 



/\ 

I date/time I | airport 



I destination I 



\San Di^] 



ICaboSL] 



\San Dieg^ 
^ ICaboSL] 



/\ 

|date/time| | airport ] 

? \San Diego] 



]Catx)SL] 



Fig. 4.13. List of alternative slot items arranged in a frame tree consisting of an 
action layer (directly below the tree’s root) and the objects operated upon. The 
system has understood the user wants to start a dialogue about a flight from San 
Diego to Cabo San Lucas, or a trip in the opposite direction. It is unsure if the 
speaker wants to book a flight or merely wants information, therefore, at the root 
level, it needs to clarify whether it should go with the “book” or the “inform” 
action. In the “book” branch, the discourse engine needs a strategy to find out at 
what time and date the user wants to fly, which place is the departure airport, and 
which one is the arrival airport. The “inform” branch has two conflicting objects, 
either the user wants information about one of two possible airports or desires 
information about a flight. Note that the flights in the branches of the two different 
actions are actually the same one. Hence, if the discourse engine manipulates the 
data in one branch, it should do the same at the other end 
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Statistical Framework. Of course, these candidates rarely have the same like- 
lihood. In the example below, if the language understanding module spots 
a “from” directly before a city name, this city is more likely to be properly 
placed at the starting point node of the trip than at the arrival node. If an 
airport name was understood with poor recognition confidence, it is consid- 
ered less probable, but maybe the user profile shows the traveler has often 
gone there in the past, then this node’s probability score is raised again. If, in 
subsequent dialogue turns, new evidence advocates one candidate, its score 
is raised and the value of conflicting alternatives consequently lowered. Once 
a complete frame/branch can be found and its score is sufficiently high, the 
frame can be accepted - with a measurable confidence. 

If the system allows handling more than one completed frame, it gains 
the capability for multi-tasking and multi- threading^ i.e. the capability to 
handle multiple applications simultaneously and follow multiple lines of the 
conversation, respectively. (See e.g. CSLFs WITAS dialogue system [4.31] 
designed to control a robotic helicopter equipped with vision and planning 
capabilities. The - grounded - speaker communicates with the vehicle to 
specify instructions and goals, monitor the progress of multiple current and 
future activities, and solve problems jointly with the helicopter’s planning 
engine.) 

Integrating probabilities, confidence measures, and likelihoods in a statis- 
tical framework for the whole system is no small but a rewarding effort. Every 
time a dialogue decision has to be made, it can be based on a statistically 
sound, real- valued figure rather than on ad-hoc, binary rules. Suppose the 
discourse engine is interested in filling a particular slot and gets a very high 
probability measure based on acoustic and language model confidence, user 
profile, and context match, the slot can be filled with the candidate value 
straight away. If it is less certain, it is advisable to employ confirmation cy- 
cles, and maybe the dialogue manager decides to discard incoming semantic 
data that is too questionable and asks the user to repeat the utterance. 

Or consider a system that usually works elegantly and allows mixed ini- 
tiative, i.e. both the user and the machine can initiate a discourse turn, for 
example, enter semantic data or ask questions. These applications are more 
complex than a machine-prompted dialogue and hence deteriorate more eas- 
ily as the speech recognition performance gets worse in a challenging acoustic 
environment. If the discourse engine has an appropriate performance mea- 
sure, it can sense that the dialogue doesn’t run smoothly anymore and switch 
back from mixed-initiative discourse to a machine-guided conversation^ (cf. 



^The proper distribution of initiative also depends on the user. Expert users 
work well with a mixed- initiative dialogue. This way, they have more freedom of 
how to phrase their requests and can therefore use the system more efficiently. 
Naive users can get overwhelmed by the machine’s complexity and can be unsure 
what options they have at a given point in the dialogue. In this case, switching to 
a guided dialogue avoids frustration. 
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for instance the LIMSI ARISE system [4.32]), e.g. asking questions like “To 
which city do you want to fly?” This way, it is more likely that the user 
answers in context and possibly even reuses the words and phrase structure 
he has heard in the system prompt when he formulates his response: “Paris” 
or “I want to fly to Paris.” 

4.2.6 Response Generation 

The last modules in the dialogue process are responsible for generating a 
sentence and vocalizing it to the user. This incorporates three stages: 

- concept generation - create the idea of a sentence 

- text generation - produce a natural language string of words 

- speech synthesis - generate the speech signal. 



Concept and Text Generation. At the first step, the system recognizes the 
need to address the user. This message idea can arise in any unit of the system; 
the speech recognition might And the acoustic confidence too low and ask the 
user to get closer to the microphone, the natural language understanding part 
might not understand a reference and ask for clarification, and, chiefly, the 
discourse engine will initiate utterances, say, to inquire the value of a slot, 
answer a question, or inform the user of an event. 

At the end of the concept generation stage, the modules have produced a 
logical description of “what to say.” Similar to NLU output, the Anal seman- 
tic/pragmatic representation can be expressed either as a frame, in quasi- 
logical form, or in a system-specific format. In small systems that produce 
only a limited range of short utterances, it is possible to pass on only the 
key to a template - a pattern resembling a sentence outline with gaps or 
placeholders to insert words from the specific context - written in a text 
generation codebook, and the items with which to All the template. 

Once a logical representation has been created, this idea of an utterance 
can be turned into natural language, that is, text. Ideally, this works like a 
pass through the system’s natural language processing unit carried out in the 
opposite direction.® 



®In this manner, it can also be used for translation: the text is analyzed by the 
NLU and recorded in a language-independent semantic/pragmatic notation like the 
quasi-logical form. This is used as utterance concept for a text generation module 
working in another language, producing text output in that language. 

As is the case with natural language processing, translation can also be ap- 
proached from a statistical position. (See the Verbmobil documentation [4.17] for 
a project using several translation methods.) The process can be understood as 
finding the word sequence T[ j out of all possible sentences Ti,j = ti, . . . ,tj,tj 
in the target language that best represents the sentence Oi,j = oi,...,Oi,o/ (in 
the original language), i.e. T/ j = arg maxTi j T(Ti,j|Oi, 7 ), and, using Bayes: 
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Suppose the system is a small hotel reservation application for the three 
hotels in Smallville, and the response generation uses codebook lookup. When 
asked “Is there a room at the Palace Hotel?”, it would possibly be able to 
throw a predefined answer concept after completing the vacancy check, e.g. 
something like 

response #5 ARGUMENT Palace Hotel 

which can be looked up by the text generation’s list of response templates. It 
pulls out number five, inserts “Palace Hotel,” and presents a string like “Yes, 
there are rooms available at the Palace Hotel.” If ‘number five’ contained 
not only one string but a list of similar utterances with the same meaning 
from which the response generation can choose (so that the user does not 
always hear the same answer and gets annoyed by the repetition), even a 
small system like this one exhibits a fairly natural feel. 

Needless to say, once a domain gets more complex, this approach is un- 
feasible; the number of all possible utterances is just too high. In this case, a 
frame-based representation can help: 

SpeechAct: Yes/No Answer “YES” 

Action: checkVacancy 

Object: Hotel 

Name: Palace Hotel 

To generate the corresponding text, the response generation employs the 
syntactic structure it has in store for the Yes/No answer speech act. The 
action and object specific components of the sentence are made available to 
the system in the developer-entered domain description. 

The most general approach is a description that encodes the utterance 
concept in a powerful, domain- independent semantic description. Again, the 
logical form is an apt representation: 

(YESNO Answer Yes ( < Present EXIST > el 

[THEME Quantity:l-h (<ROOM rl>) ] 

[AT-LOC <THE PALACE HOTEL hl>])) 

T[ j — argmaxTi^j P(Oi,j|Ti,j)P(Ti,j), where P(Ti,j) is a language model of 
the target language (e.g. a trigram model), and P(Oi,/|Ti,j) is the translation 
model. To take care of different word order in different languages, we introduce 
an alignment model Aij = ai,...,ai,a/ that maps position i of source word 
Oi to target word U’s position j — ai. A fertility model Fij likewise decides 
how many words in the target language are produced by each word Oi in the 
original sentence. Aij and F±j are included in the language model resulting in 
T[ j — argmaxTi j P(Oi,j, Ai,/, Fi,/|Ti,j)P(Ti,j) (cf. the original work carried 
out at IBM [4.33] and the algorithms put to use in the Verbmobil translation 
project [4.17]). The parameters are trained on parallel bilingual corpora such as the 
Hansards, the English /French transcriptions of parliamentary debates in Canada. 
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Just like in a parser, it can be convenient to use multiple layers to go from 
a semantic/pragmatic encoding like a logical form to intermediate notations, 
e.g. describing types of phrases, and ultimately, to output a word string. 

4.2.7 Speech Synthesis 

Finally, at the output module, the generated text is converted to speech. Two 
alternative techniques serve to accomplish this feat, one based on the idea of 
modeling the process of articulation, the other one constructing a signal by 
appending sections of recorded speech samples. 

Model-Based Synthesis. Before the actual output signal can be computed, 
the word string is first translated to its phonemic representation. Usually, 
the text-to-speech translator (TTS) contains a lexicon in which it can look 
up the proper transcriptions. Another option is to generate phonemes on the 
fly for every utterance using a set of pronunciation rules. ^ 

Often, systems combine both methods: lexicon lookup, employed for words 
that are known to the system beforehand, and transcription rules for entries 
such as proper names that are added by the user at a later time, for instance 
the names in a phone directory. 

These transcription rules need not be explicitly entered into the TTS. 
Sejnowski and Rosenberg, in 1987, trained a neural network on written text 
and its corresponding phonemic output as provided by the DECtalk expert 
system. The network which they called NETtalk ( [4.35]) captured the be- 
havior nicely, and curiously, while learning to speak, it sounded much like 
a babbling infant. So, while most systems are rule-based (expert systems), 
saving time by setting up a learning-by-doing pattern recognition engine is 
an option - at least for French!^ 

Once the phonemic symbols are prepared, the utterance can be vocalized. 
In the model-based approach, synthesis is carried out as a functional, gener- 
ative process, mimicking human speech production to a more or less detailed 
degree. Articulatory models attempt to capture the physical properties of the 
vocal tract and its organs as speech is produced, creating the output sound as 
a source signal - oscillation or noise, for voiced/unvoiced speech, respectively 
- Altered with the impulse response of the closely modeled vocal tract. More 
abstract systems do not necessarily attempt a one-to-one representation of 

^The number of transcription rules necessary to generate a phonemic transcrip- 
tion from text varies greatly from language to language. A well-behaved tongue 
like Spanish requires less than 100 of such rules - to cover French pronunciation, 
it takes more than 500 ( [4.34]). To beef up the challenge some more for the for- 
eign language student, the text-to-phone map in French is often surjective: many 
endings, which, written down, allow to grasp a word’s morphology and hopefully 
its sense, end up in the same or very similar phoneme bins; just consider the nasal 
endings -ant, -ent, -ont, -and, -ans, -amps, -an, ... If this is bad for speech synthesis, 
it’s even worse for speech recognitionl 
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the involved physiology and its mechanics, rather, they try to capture its 
behavior on a higher level. 

With text-to-phoneme rules on the order of 100 and, owing to coarticula- 
tion (i.e. the change in phoneme articulation caused by neighboring sounds) 
some 1000-30000 rules how to articulate and produce these sounds, the mem- 
ory requirements for a model-based text-to-speech system is quite manage- 
able, making it the method of choice for small applications that need to cover 
a large vocabulary, especially on portable systems. 

Synthesis by Concatenation. Better sounding speech can be produced by con- 
catenation of stored and labeled speech samples, usually though at the cost 
of a bigger memory footprint. Going back to the text generation examples, 
in a small dialogue application with a fixed set of responses it is feasible and 
useful to record all of these responses once and thus obtain the most natural 
output quality possible. The next step towards a more fiexible vocabulary is 
the template-based synthesis utilized by many telephone dialogue systems, 
for example a finite state airline lost baggage report application: 

- System: “Enter your arrival airport.” 

- User: “St. Thomas.” 

- System: “I think you said St Thomas. Is that correct?” 

Here, the added degree of freedom is observed in a definite location of the 
response utterance - at the airport name - while the biggest part of the 
answer remains fixed and can therefore be stored as one sample. The variable 
constituents are known at the time of dialogue development and are thus also 
recorded rather than generated later on. 

In order to be able to produce all possible words, elementary building 
blocks are selected, smoothed at their edges and attached: phones, diphones, 
triphones, or pseudo-syllabic segments made up of a variable number of 
phones. 

Even though triphone-based concatenation synthesizers produce good- 
quality speech, they do not quite match the level that can be reached by 
sampling larger segments. This is the reason why one rarely encounters a 
mixture of triphone/long segment synthesis: the break in speaking style when 
switching from one mode to another one just sounds too awkward. Our ears 
seem to prefer consistency. Or rather: it’s the changes in our environment 
that get our attention, not the steady states. 

Prosody. The natural appearance of synthetic text depends to a large amount 
on the quality of its generated prosody, that is, pitch, loudness, and rhythm. 
The overall intonation contour can be computed as a function of the utter- 
ance’s structure. But besides that, our speaking style depends also on the 
emotional and emotive content of a message, the dialogue partner, and the 
environment. If the person we talk to is happy or sad, talks in a low voice 
or is hectic, we usually react to this or even imitate the manner in a process 
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called internal simulation. If we address someone in a noisy place, we raise 
our voice, adjust vowel length and timbre - the Lombard effect. Additionally, 
accentuated words experience a deviation from the regular pitch, loudness, 
and duration values. 

In a speech synthesizer, the stress marking such accents is placed on the 
corresponding syllables depending on a word’s syntactical function and the 
semantic focus of a sentence. Other than in a general text-to-speech pro- 
gram that is applied, for instance, to email or fax reading, the TTS in a 
dialogue system has the benefit of knowing both the syntactical structure 
and the semantic spotlights of a sentence - after all, it was created by the 
same machine - and therefore can compute rather than guess a comfortably 
natural prosody. This knowledge is a big advantage; while many different 
prosody contours are valid and sound authentic for a given sentence, other 
interpretations can appear definitely incorrect. 

In rule-based speech synthesis, prosodic parameters can be included in the 
mathematical model and the necessary phones created as desired. To achieve 
the proper output quality in the concatenation methodology, it can be neces- 
sary to store a number of samples of the same phone as observed at different 
pitch values (possibly also with varying loudness and duration). Even so, at 
the edges of the phonetic units, the fundamental frequencies of two segments 
will rarely match exactly. To achieve a smooth Pitch Synchronous OverLap- 
Add concatenation of the signal building blocks, time-domain PSOLA syn- 
thesizers analyze the fundamental frequency of a segment to be able to add 
or delete fundamental periods to adjust the segment duration, and to adjust 
the pitch by deleting or inserting samples in a fundamental period. 

4.2.8 Summary 

The past has seen an ever advancing progress in digital speech processing, 
from a time when a trunkful of punchcards was necessary to store a few sec- 
onds of digitized speech to now when speech processing software can easily be 
programmed on every PC, and applications are available for devices as small 
as palmtop computers, cell phones, or watches. Major technological advances 
in language processing systems have been fostered under the auspices of huge 
projects such as the DARPA communicator in the U.S. - providing a frame- 
work for prominent programs like Rochester’s TRAINS and TRIPS, MIT’s 
Mercury, BBN’s Talk ’n’ Travel, the Bell Labs, CMU, and CU Communica- 
tors - and the German Verbmobil, an automatic translation project involving 
hundreds of researchers from over 30 consortial partners in academia and in- 
dustry. 

The ultimate benchmark for a technology that is geared toward human- 
machine interaction, however, is user appreciation, not scientific brilliance. 
Here, two worlds collide: the expert who - rightfully so - gets excited as 
he explains the ingenious methodology that drives his dialogue system, and 
the 'naive’ user who doesn’t care about algorithms but wants to know when 
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he’ll be able to drive the talking car that’s been on TV for years. Often it 
is hard to convey the complexity and tremendous work required to build a 
functional system even for rather limited domains such as the translation 
scenarios of Verbmobil and the travel planning applications of the DARPA 
Communicator, with a performance that yet has a way to go to achieve 
human-like levels. 

Automatic speech recognition has been researched with huge effort for 
decades, and progress has somewhat stalled. Considerable work is done to 
squeeze out just a few percent decrease in word error rate off the hidden 
Markov model setup. Quite obviously, preparing the proper input word string 
can make or break a speech dialogue system, no matter how good the dis- 
course strategy. When the participants of a conference on automatic speech 
recognition and understanding in 2003 were asked at what point in time ASR 
would achieve human-like performance, the maximum number of responses 
hinted at the year 2020; the average voter was more pessimistic, suggest- 
ing 2064 [4.36]. In the meantime, it will be thrilling to find out whether the 
progress is achieved through persistent work on the current techniques such 
as HMMs and trigrams, or whether there shall be a turn to new concepts. - 
If everything else should fail, there’s still Moore’s law to the rescue: every 18 
months, we can expect to find twice as many transistors on a chip as one and 
a half years earlier, and many applications will become better simply because 
more computational force can be applied. 

The ultimate ambition of research in the field of dialogue systems is to 
build a Star- Trek application that can converse intelligently about any given 
topic. Most likely, that goal lies quite a bit further away in the future than 
speech recognition with human-like accuracy. Prerequisites for such a machine 
are automatic knowledge acquisition, representation, and reasoning, so that 
systems do not need to be manually hacked to tackle a few discourse domains 
but are able to grasp larger pieces of the world - be it through human-guided 
training or autonomous learning. 

While these capabilities are largely science fiction, the work on speech 
dialogue systems, at the focus of applied linguistics and signal processing, 
is a very enjoyable field. After all, talking cars and spaceships are the stuff 
movies are made of, and to a limited extent, applications start to appear 
indeed now in these very domains, promising this workplace will grow even 
more fascinating in the future. 
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[British Prime Minister Ramsey] MacDonald has the gift of compressing the 
largest amount of words into the smallest amount of thought. 

Winston Churchill (1874^1965) 



Speeches in our culture are the vacuum that fill a vacuum. 

John Kenneth Galbraith (horn 1908) 



Speech compression, once an esoteric preoccupation of a few speech enthu- 
siasts, has taken on a practical significance of singular proportion. As men- 
tioned before, it all began in 1928 when Homer Dudley, an engineer at Bell 
Laboratories, had a brilliant idea for compressing a speech signal with a 
bandwidth of over 3000 Hz into the 100-Hz bandwidth of a new transat- 
lantic telegraph cable. Instead of sending the speech signal itself, he thought 
it would suffice to transmit a description of the signal to the far end. This 
basic idea of substituting for the signal a sufficient specification from which 
it could be recreated is still with us in the latest linear prediction standards 
and other methods of speech compression for mobile phones, secure digital 
voice channels, compressed-speech storage for multimedia applications, and, 
last but not least, Internet telephony and broadcasting via the World Wide 
Web. 

The ultimate speech compression could be achieved by first recognizing 
speech and then resynthesizing a speech signal from the recognized text. 
Given that the entropy of written English, according to Shannon, is about 
2.3 bits per letter and assuming a speaking rate equivalent to 10 letters per 
second, speech - without intonation and other personal characteristics - can 
be compressed into some 23 bits per second. Whether anyone would like to 
listen to the output from such a scheme is of course doubtful. But there 
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may well be situations where this ultimate compression is the only option for 
speech transmission. 



5.1 Vocoders 

Dudley’s original idea was to transmit information about the motions of the 
speaker’s articulatory organs, the tongue, the lips, and so forth. When this 
turned out to be impossible (in fact, it is still difficult to extract these param- 
eters from a running speech signal), Dudley suggested sending a succession 
of short-time spectra instead. This led to the channel vocoder in which each 
speech spectrum is described by its smooth spectral envelope and, in the 
case of voiced sounds, the spectral fine structure^ that is, the spacing of the 
harmonics of the fundamental frequency or voice “pitch” [5.1]. 

As described previously, the first important application of the vocoder 
occurred in World War II when it was used to encrypt the telephone link 
between Churchill in London and Roosevelt in Washington. The compression 
made possible by the vocoder permitted the speech signal to be encoded by 
as few as 1551 bits per second, a bitrate that fitted into existing transatlantic 
radio channels [5.2]. 

The first X- System, as it was called, was operated on 1 April 1943, just 
seven months after this complex project had been launched. This short time 
span is all the more astonishing considering that much of the circuitry was 
completely new. One of the ingredients of this lightning success - always 
crucial in a national emergency - was the decision to assemble the system, as 
far as possible, from existing, off-the-shelf Western Electric hardware even if 
it was not always ideal. 

The X-System occupied 30 seven-foot relay racks and consumed 30 kW of 
electrical power (to produce, as the sarcastic saying went, “one milliwatt of 
poor-quality speech.”). After Washington and London, X-Systems were in- 
stalled in North Africa, Paris, Hawaii, Australia and the Philippines [5.3]. The 
design of the X- System gave a considerable impetus to pulse code modula- 
tion or PCM [5.4]. It also stimulated renewed thinking about secrecy systems 
in general, especially on the part of Claude Shannon, who had returned to 
Bell Labs in 1941 from MIT and Princeton and who served on several com- 
mittees dealing with cryptanalysis [5.5]. In particular Shannon was asked to 
take a close look at the modular arithmetic (then called “reentry process”) 
used in the encryption to make sure that nothing had been overlooked in the 
assumption that the key was unbreakable. 

It is interesting to note that Shannon’s subsequent paper on “information 
theory” (as it has been misleadingly called) was closely tied to his work on 
secrecy systems [5.6]. 

The vocoder work for civilian applications was resumed after the war, 
but it had to start almost from scratch because the progress made during the 
war, along with numerous patents, were classified “top secret” and kept under 
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wraps for decades [5.7]. For general telephony, the principal difficulty was the 
“pitch problem,” the need to track the exact fundamental frequency in real 
time. To make matters worse, the fundamental frequency of most telephone 
speech is actually missing in telephone lines that do not transmit frequencies 
below 200 or 300 Hz. 

When the author joined these efforts at Bell Telephone Laboratories in 
1954, the most promising approach to the pitch problem was autocorrelation 
analysis. Of course, for a steady, noise-free voiced speech sound, the first 
maximum of the autocorrelation function (at nonzero delay) occurs at a delay 
corresponding to an integral pitch period. But, unfortunately, speech signals 
are not stationary and the highest maximum in the delay region of interest 
often corresponds to a delay of one pitch period plus or minus one formant 
period. A vocoder driven from such a pitch signal sounds rather queer - 
drunk or “tipsy,” to be more precise. To overcome these difficulties dozens of 
schemes were floated and tried - and found wanting for various reasons [5.8]. 
The pitch problem was finally laid to rest with the invention of cepstrum 
pitch detectors [5.9]. The cepstrum is defined as the Fourier transform of the 
logarithm of the power spectrum, see Chap. 10. 

However, even the cepstrum had problems with tracking the pitch of two 
different voices on the same line. For such cases, a better solution than the 
cepstrum turned out to be the Fourier transform of the magnitude of the 
Fourier transform, in other words, replacing the logarithm of the power spec- 
trum (as in the cepstrum) by the square root of the power spectrum [5.10]. 

Another method that sometimes outperformed the cepstrum is the “har- 
monic product spectrum,” in which each observed harmonic frequency is 
considered, in a probabilistic manner, an integer multiple of the fundamental 
frequency [5.11]. 

The frequency-channel vocoder was soon joined by numerous other para- 
metric compression schemes such as formant vocoders [5.12], harmonic com- 
pressors [5.13], correlation vocoders [5.14], and phase vocoders [5.15]; see 
[5.16] for a review of speech analysis and synthesis by vocoders, see J. L. 
Flanagan’s comprehensive survey [5.17]. 



5.2 Digital Simulation 

Before the advent of digital simulation in the late 1950s, new ideas for speech 
processing had to be tried out by building analog devices. More often than 
not, failure was attributed not to the idea per se but to a flawed implementa- 
tion. All this was changed drastically by digital simulation [5.18], facilitated 
by the BLODI (for BLOck Diagram) compiler of J. L. Kelly, V. A. Vyssotsky 
and Carol Lochbaum [5.19]. 

One of the first applications of digital simulation was made by M. V. 
Mathews, namely a method for audio waveform compression called extremal 
coding [5.20]. In extremal coding only the positions and amplitudes of the 
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maxima and minima of the waveform are maintained, while intermediate 
values are approximated by a cubic “spline function.” 

This was one of the first successful digital simulations of a signal proces- 
sor on a general purpose computer at Bell Labs. Digital simulation had been 
transplanted from MIT to Bell Labs by Mathews in 1955, when he joined 
Bell’s acoustics research. Extremal coding was tailor-made for digital simu- 
lation because computer-running times were quite reasonable. (The author 
later took digital simulation to “unreasonable” lengths by simulating signal 
processors containing hundreds of sharp bandpass filters [5.13] and simulating 
sound transmission and reverberation in full-size concert halls [5.21].) 

Some of the processing was done on the largest extant IBM computer at 
their New York City headquarters at Madison Avenue and 57th Street. A few 
seconds of speech typically filled a large car trunk with punched cards. At 
that moment in the history of signal processing, digital simulation involved 
risking parking tickets. 



5.3 Linear Prediction 

Linear prediction, especially in the form of code- excited linear prediction 
(CELP) has become the method of choice for speech and even audio com- 
pression. 

The mathematics of linear prediction goes back to Carl Friedrich Gauss 
(1777-1855) and Norbert Wiener (1894-1964). Gauss correctly predicted the 
reappearance of the asteroid Ceres after it had been lost in the glare of the 
sun and he gained world- wide fame for this feat. Wiener was instrumental in 
marshalling prediction for anti-aircraft fire-control during World War II. 

Linear prediction came to the fore in speech research in 1966. After 12 
years of work on vocoders, the author had become somewhat impatient with 
their unsatisfactory speech quality. The idea was to encode speech signals 
not in a rigid vocoder-like fashion but to leave room for ‘error’ in the coding 
process. Specifically each new speech sample was “predicted” by a weighted 
linear combination of, typically, 8 to 12 immediately preceding samples. The 
weights were determined to minimize the r.m.s. error of the prediction. 

Thus was born linear predictive coding (LPC) for speech signals with a 
prediction residual or “error” signal to take up the slack from the prediction. 
Since speech is a highly variable signal, B. S. Atal and the author opted for 
an adaptive predictor [5.22]. 

There are in fact two kinds of major redundancies in a voiced speech sig- 
nal: from the formant structure (decaying resonances of the vocal tract) and 
the quasiperiodicity (“pitch”) of the vocal cords oscillations. Thus, our adap- 
tive predictor consisted of two parts: a short-time (ca. 1 ms) predictor for the 
formant structure or spectral envelope and a long-time (3 to 20 ms) predictor 
for the pitch of voiced speech. We chose 8 predictor coefficients for the short- 
time predictor: two each for the three formants in the telephone band and 
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two more to take care of the glottal pulse and the frequency dependence of 
the radiation from the lips. With only one bit per second allotted to the pre- 
diction residual (but no coding of the slowly- varying predictor coefficients) a 
speech quality indistinguishable from the input was achieved [5.23]. 

Having taken up research in hearing in the 1950s (with the aim of design- 
ing better-sounding speech coders), the author proposed replacing the r.m.s. 
error criterion in linear prediction by a subjective measure, namely the per- 
ceived loudness of the quantizing noise. Given a proper spectral shape, the 
quantizing noise becomes less audible or is even completely masked by the 
speech signal itself. Beginning in 1972, J. L. Hall, Jr., and the author mea- 
sured the masking of noise by signals (as opposed to the customary masking 
of signals by noise) [5.24]. The result was linear prediction with a perceptual 
error criterion [5.25]. 

Methods of perceptual audio coding (PAG) have now found wide applica- 
tion in speech, music, and general audio coding. Together with an excitation 
signal derived from a code book (code-excited linear predictor or CELP), bit 
rates for the prediction residual of 1/4 bit per sample for high-quality speech 
were realized by Atal and the author in the early 1980s [5.26]. For audio 
compression, rates as low as 16 kilobits per second were demonstrated by D. 
Sinha, J. D. Johnston, S. Forward and S. R. Quackenbush. Near compact-disc 
(CD) quality was achieved at 64 kbps [5.27]! 

5.3.1 Linear Prediction and Resonances 

The fact that a single resonance is characterized by just two parameters 
manifests itself in another way - in a manner that is much easier to exploit. 
If we turn our attention from the frequency domain, we can see that for a 
single, unperturbed, freely decaying resonance (after excitation has ceased) 
just three successive samples determine the entire future of the waveform. 
More generally, if we consider speech below 4 kHz to be equivalent to four 
resonances, then 4 • 2 -f 1 = 9 speech waveform samples suffice for a complete 
specification of the free decay. Any additional samples are useful for the 
specification of the vocal cord excitation function, the effect (if any) of the 
nasal cavity, and the radiation characteristics of the lips. 

Now, if we insist on extracting the resonance frequencies and bandwidths 
from these waveform samples, we would be back where we started: the “in- 
tractable” formant-tracking problem. Instead, let us ask whether we cannot 
generate the speech spectrum from these waveform samples directly^ without 
the detour via the formant frequencies. The answer is a resounding “yes”. 
The proper prescription is the aforementioned linear prediction^ which, to- 
gether with the development of potent chips, has led to the great growth of 
speech processing that we have witnessed during the past decade. 

Historically, the introduction of linear prediction to speech was triggered 
by a train of thought different from the one sketched above. The problem 
that the author and his collaborator B.S. Atal turned their attention to in 
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1966 was the following: for television pictures, the encoding of each picture 
element (“pixel”) as if it was completely unpredictable is of course rather 
wasteful, because adjacent pixels are correlated. Similarly, for voiced speech, 
each sample is known to be highly correlated with the corresponding sample 
that occurred one pitch period earlier. In addition, each sample is correlated 
with the immediately preceding samples, because the resonances of the vocal 
tract “ring” for a finite time that equals their reciprocal bandwidth (roughly 
10 ms). Therefore, at a sampling rate of 8 kHz, blocks of 80 samples show 
an appreciable correlation. However, since the number of “independent” pa- 
rameters in a speech signal is approximately 12 (two parameters for each 
resonance and a few extra parameters for the excitation function, nasal cou- 
pling and lip radiation) it suffices to consider the first 12 or so correlation 
coefficients and use only these in the prediction process. 

Pursuing this predictive philosophy then, and focusing on the short-time 
correlations first, we want to approximate a current sample Sn by a linear 
combination of immediately preceding samples Sn-k'- 



— ^l^n—1 ^2^n — 2 • • • O^p^n—p 677 , , (^* 1 ) 

where p is the “order” of the predictor and is the “prediction residual,” 
that is, that part of Sn that cannot be represented by the weighted sum of 
p previous samples. The weights in (5.1) are called the predictor coefficients 
[5.28]. 

How should the weights be chosen? The analytically simplest method 
is to minimize the squared prediction residual averaged over N samples: 

N N / p \ 2 

E := ^ ^ ^ ^ ^ ^ CLk^n—k I 5 <3-0 = 1 . (5-2) 

n=l n=l \/e=0 / 

A typical value for A' is 80, corresponding to a “time window” of 10 ms. Larger 
values of A, implying summation over longer time intervals, would interfere 
with the proper analysis of the perceptually important rapid transitions be- 
tween successive speech sounds. Much smaller values of A would make the 
analysis unnecessarily prone to noise and other disturbing influences. 

Minimization of E with respect to the ak means setting the partial deriva- 
tives dEjdam equal to zero: 



dE 

0am 



N p 
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n=l k=Q 



Inverting the order of the two summations in equation 5.3 yields: 



p 

^ ^ ^ mk^^k 0 ? ^ 1 ? • • • ? P 5 

where we have introduced the correlation coefficients: 



(5.3) 



(5.4) 
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N 

'^mk ~ ^ ^ ^n — m^n — k • (^•^) 

n=l 

Having determined the correlation coefficients Vkm from the speech signal, 
we can use (5.4) to calculate the predictor coefficients a^. Remembering that 
ao = 1, we have: 

p 

^ -rmo , m = l,...,p, (5.6) 

k=l 

or in matrix notation: 

Ra = —To , (5.7) 

where R is the p x p matrix with entries and a and pq are the column 
vectors corresponding to ak and Pmo in equation (5.6), respectively. Equation 
(5.7) is solved by matrix inversion, giving the desired predictor coefficients: 

a = — R“Vo . (5.8) 

Of course, for R~^ to exist, the rank of R must be p. 

Numerous algorithms have been proposed to speed up the calculation of 
R~^. Some are based on the fact that R becomes a Toeplitz matrix if we set 

’^mk '^\m — k\ • (^’^) 

This is a justified simplification because the correlation Vmk depends mostly 
on the relative delay \m — k\ between Sn-m and Sn-k- Other algorithms 
exploit the fact that the partial correlation coefficients [5.29] can be computed 
by simple recursion, which reduces the total number of multiply-and-add 
operations. These are very important savings when it comes to the design of 
coders working in real-time. 

The partial correlations are also directly related to the refiection coeffi- 
cients of a lossless acoustic tube consisting of cylindrical sections of equal 
length [5.30]. Thus, there is a one-to-one relationship between partial corre- 
lations and cross-sectional areas, i.e. the geometry, of such a tube. 

Figure 5.1 shows a speech signal (upper trace) and the prediction residuals 
after formant prediction (center trace, amplified by 10 dB). As expected, 
the short-time correlation is almost completely removed. But the long-time 
correlation, from pitch period to pitch period, still persists - as evidenced by 
the large spikes at pitch-period intervals. 

To remove this correlation, we have to find out for which delay in the pitch 
range (for example, 3-20 ms) the autocorrelation function of the prediction 
residual has a maximum and how high it is. Then we subtract the properly 
delayed and amplitude scaled prediction residual from itself. The result is 
the lower trace in Fig. 5.1, amplified by another 10 dB for better visibil- 
ity. The resulting residual, after both short-time (“formant”) and long-time 
(“pitch”) prediction is a highly unpredictable waveform of relatively small 
power. The signal power divided by the power of the prediction residual is 
called the prediction gain. In Fig. 5.1 the prediction gain is about 20 dB. 




114 5. Speech Compression 



SPEECH 



IK 

PR EDICT I OH 
RE£l[mAL 



PREOICTtaH 
RESIDUAL 
AFTER 
PITCH 

PREDICTION 

0 25 50 75 

TIME (msec) 

Fig. 5.1. From the speech signal to the prediction residual. The top trace shows a 
100-ms excerpt from a male speech signal, the onset of a vowel. Note the increas- 
ing intensity and periodicity with a period of about 8.3 ms (120 Hz). - The center 
trace, amplified by 10 dB to show more detail, illustrates the effect of predicting 
the formant structure, which manifests itself in “bumpy” spectral envelope and 
short-delay correlations. The “LPC” residual resulting from linear prediction has a 
(nearly) flat spectral envelope, but a pronounced harmonic line structure. This spec- 
trally flattened signal is ideal for pitch detection purposes, because the periodicity 
due to the voice pitch stands out clearly, unencumbered by the formant structure 
(which has given classical pitch detection so much trouble). - The bottom trace, 
amplified by 20 dB relative to the top trace, shows the prediction residual after 
both short-delay (formant) and long-delay (pitch) prediction. This residual signal, 
although still showing traces of pitch and formant structure, has an essentially flat 
spectrum. Encoding it under a minimum mean-square-error criterion leads to more 
audible quantizing noise than would a spectrally weighted error-criterion. An appro- 
priately chosen subjective error criterion exploits the fact that the spectral regions 
of the quantizing noise that coincide with a formant are reduced in loudness or are 
even inaudible as a result of auditory masking (see Figs. 5.4 and 5.5) 




Again, note that this removal of redundancy from the original speech signal 
has been achieved by exploiting the formant and pitch structures of speech 
signals without explicitly measuring formant frequencies and pitch-periods. 
For example, if the delay for which the long-time correlation was a maximum 
was not equal to the pitch period, it would not matter at all. In fact, we 
do not want to measure the pitch period (a difficult task, just like precise 
formant tracking) ; we want to remove as much correlation as possible by the 
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prediction process. This robustness to measurement “error” is just one more 
reason for the success of linear prediction! 

There is an interesting frequency-domain interpretation of minimizing the 
prediction error (for fixed gain): minimizing the prediction error in the time 
domain is equivalent to minimizing the average ratio of the speech energy 
spectrum to the estimated energy spectrum based on the all-pole model. 
The minimization is also equivalent to maximizing the spectral flatness of 
the prediction error, defined as the ratio of the geometric mean of the error 
energy spectrum to its algebraic mean. For a fiat spectrum this ratio is 1; for 
highly nonfiat spectra the ratio tends to 0 [5.31]. 



5.3.2 The Innovation Sequence 

From (5.1), it is obvious that we could synthesize the speech samples Sn 
by feeding the prediction residual or “innovation sequence” (if it were 
available) into a synthesis filter. Defining a prediction filter with the (discrete 
and finite) impulse response 1, ai, a 2 , . . . , and adopting the 2 :-transform 
notation s{z) for 5^, e{z) for and 

A{z) = 1 -f- aiz~^ -h • • • + apZ~'^ (5.10) 

for the prediction filter, (5.1) can be written as: 

s{z) • A{z) = e{z) (5.1a) 

that is, the speech signal could be obtained from the innovation sequence 
e{z) by filtering it with 1/A{z): 

According to the fundamental theorem of algebra, the polynomial A{z) has 
precisely p zeros and no poles. The filter l/A{z) therefore has only poles and 
is called an “all-pole filter” . 

Now the circle is complete; we wanted to exploit the fact that speech 
signals can be efficiently represented by poles - without actually having to 
specify them. Equation (5.11) is the desired answer - the prediction filter 
A{z) represents these poles, a fact which is brought into evidence by writing 
A{z) as a product, according to the fundamental theorem of algebra 

A{z) = (1 - z\jz){\ - 22 / 2 :) • • ■ , (5-12) 

where 21 , 22 , ■ • • are the zeros of A{z) or the poles of 1/A{z). The relationship 
between the complex poles Zk and the formant frequencies u)k and bandwidths 
Aujk, in units of the sampling angular frequency, is the following: 

ujk = Im{ln 2 fc} 

AuJk = -‘2.\n\zk\ , 



(5.13) 

(5.14) 
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Fig. 5.2. A short segment of a voiced speech signal (upper right inset) is Fourier 
transformed to yield this logarithmic power spectrum. Note the regular harmonic 
structure of the spectrum at multiples of the fundamental frequency (ca. 120 Hz). 
The dotted line is the spectral envelope computed from the frequency response of 
the prediction filter. - Short-time spectra, such as those shown in Figure 5.2 and 5.3, 
have a high frequency resolution in spite of the shortness of the time window. This 
feat is accomplished by performing the Fourier transform on the spectrally flattened 
speech signal, thereby minimizing the frequency uncertainties (“splatter”) resulting 
from short time windows 



where In is the natural logarithm. Figure 5.2 shows a voiced speech waveform 
segment (upper right) and the corresponding spectrum together with spectral 
envelope defined by the filter 1/A{z). Note the harmonics of the fundamental 
frequency, characteristic of nearly-periodic voiced speech sounds and the good 
approximation of the spectral envelope XjA^z) (the continuous line “riding” 
on the spectrum). 

Figure 5.3 shows corresponding results for an unvoiced (fricative) speech 
sound. There is no periodicity in the time waveform (upper right) and no har- 
monic line structure in the spectrum. Although the spectral envelope may not 
have been produced by poles alone (there is evidence of spectral zeros), the 
approximation of the spectral envelope by 1/A{z) is still more than adequate. 



5.3.3 Single Pulse Excitation 

In a linear predictive coder (LPC), the excitation is exactly as in the clas- 
sical frequency-channel vocoder. An excitation source, emitting either pitch 
pulses or random noise, is controlled by voiced-unvoiced decisions and pitch 
detection at the transmitter. However, the excitation signal does not drive a 
set of parallel frequency channels but rather the synthesis filter 1/A{z) whose 
parameters are set by the predictive analysis at the receiver. 
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Fig. 5.3. A short segment of an unvoiced speech signal (upper right inset) and its 
Fourier transform, obtained by the method described in the caption to Fig. 5.2. Note 
the absence of harmonic frequencies and the broad spectral peak, characteristic of 
unvoiced sounds. Noise-free coding of such hiss-like sounds is not critical, because 
“noise plus noise equals noise” and the ear is somewhat more tolerant to spectral 
distortions of fricative speech sounds 



Linear prediction represented a great leap forward in the analysis and 
synthesis of speech, but the quality of the LPC speech, although notice- 
ably improved over that of the channel vocoder, still had a buzzy twang to 
it (possibly because the sharp (“zero-phase”) pulses used in the excitation 
have too much phase coherence between their harmonic frequency compo- 
nents). To alleviate this problem two rather different approaches have been 
pursued: multipulse excitation and stochastic coding of the prediction resid- 
ual. While single-pulse excitation requires pitch detection, multipulse and 
stochastic coding proceed without pitch detection^ thereby bypassing one of 
the most difficult problems of speech analysis. 

The measurement of the voice fundamental frequency [5.32] must be very 
accurate because the human ear is highly sensitive to pitch errors. One rea- 
son why precise pitch detection has been so difficult is due to the formant 
structure of speech signals. For example, a low first-formant frequency will 
sometimes be confused with the fundamental frequency (especially for female 
voices, which generally have a higher pitch). 

Two methods have been very successful in eliminating the bothersome 
formant structure: cepstrum pitch detection [5.9] and spectrum flattening^ 
first practiced in the voice-excited vocoders (VEV) [5.33] and proposed for 
pitch detection by M.M. Sondhi [5.34]. In the context of linear predictive 
coding, spectrum flattening is particularly germane, because the prediction 
error filter A[z) is in fact a very effective spectrum flattener. How prominent 
the fundamental frequency becomes after filtering with A{z) is illustrated by 
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the center trace of Fig. 5.1, which shows prominent pitch pulses that should 
certainly ease the task of almost any pitch detector. Interestingly, there is a 
close connection between predictor coefficients and the cepstrum (see Sect. 
10.14) which allows the computation of one from the other, either recursively 
or directly, see Appendix B. 



5.3.4 Multipulse Excitation 

Careful listening - and subsequent spectral analysis - has shown that replac- 
ing the excitation signal by a single pulse per pitch period, no matter how 
well positioned the pulse may be, produces audible distortion (the already 
noted twang). B.S. Atal [5.35] therefore suggested using more than one pulse, 
typically eight, per period and adjusting the individual pulse positions and 
amplitudes sequentially to minimize a “perceptual error,” that is, a spectrally 
weighted mean-square error. This technique results in a better speech quality, 
not only because the prediction residual is better approximated by several 
pulses per pitch period (instead of a single one), but also because the mul- 
tipulse algorithm does not require pitch detection. In fact, a predetermined 
number of excitation pulses are assigned to each time window regardless of 
the pitch. 

However, herein lies also the weakness of multipulse excitation: the fixed 
rectangular time window associated with multipulse synthesis causes some 
roughness in the output speech because of rapid non-pitch-synchronous vari- 
ations of the predictor and multipulse parameters. Thus, while multipulse 
excitation has led to further progress in speech coding, problems of speech 
quality persist. These shortcomings can be avoided by our next approach. 



5.3.5 Adaptive Predictive Coding 

Adaptive predictive coding (APC) of speech signals was in fact the original 
route taken when predictive analysis was first applied to speech in 1967 [5.22]. 
The idea was to avoid the pitch problem and associated difficulties of excita- 
tion altogether by transmitting a quantized version of the prediction residual 
itself to the receiver to drive the synthesis filter 1/A{z). Because of the pre- 
diction gain [5.23] of APC, typically 20 dB or more for stationary voiced 
speech sounds, a very rough quantization of the prediction residual could be 
expected to suffice. In fact, in the earliest implementations of APC, a one-hit 
quantizer gave respectable quality, corresponding in subjective quality to 5- 
or 6-bit PCM! 

To this day, it is not entirely clear why the perceived quality of one-bit 
APC was that high. The author speculated then (and still believes) that 
the ear is more sensitive to spectral errors during stationary speech sounds 
- precisely when the prediction gain is high and the quantizing error corre- 
spondingly low. By contrast, during rapid transitions between different speech 
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sounds, when the prediction gain is a fortiori small, the ear’s sensitivity to 
spectral error is small, so that it does not notice the greater quantizing error 
during transitions. 

This seductive hypothesis would explain the success of APC, but is has 
never been formally tested. It could be that the ear has available, in paral- 
lel, spectral analysis channels with different amounts of frequency resolution 
and that it uses wider channels (which would ‘‘gloss over” much spectral de- 
tail) when analyzing fast transients, because narrow channels - according to 
the Uncertainty Principle - lack the necessary time resolution to decode the 
important temporal cues in speech sound transients. 

The fact that the ear uses wider frequency channels when listening to 
speech (as opposed to slowly varying tones, for example) is nicely illustrated 
in reverberant rooms: the frequency response of such rooms fluctuates by 
40 dB or more [5.36], yet we are not aware of these fluctuations. As E.C. 
Wente, inventor of the condenser microphone, remarked when he first saw 
such a response “how can we hear at all in rooms?” - considering that we 
strive to keep frequency response irregularities in our transducers (micro- 
phones and loudspeakers) down to a few decibels. The answer in this case 
has been unambiguous: the frequency analysis of the ear, when listening to 
fast-changing signals, is so wideband that it does not resolve the peaks and 
valleys of the room response that are very narrowly spaced. (The average 
spacing is 4/T, or about 4 Hz for a reverberation time T of 1 s [5.36].) 

5.3.6 Masking of Quantizing Noise 

We have just encountered one of the important subjective viewpoints, namely 
the ear’s frequency resolution, that hold the key to efficient high-quality 
speech coding. Another important observation is that the quantizing noise 
in APC (or any quantizing scheme) is “masked,” that is, made inaudible, 
or at least reduced in loudness, by the simultaneous presence of the speech 
signal itself. Auditory masking is a pervasive phenomenon of hearing. One 
of the latest manifestations is the walkman-cum-earphones equipped jogger 
(cyclist or mere pedestrian) who is completely oblivious to the acoustic traffic 
surrounding him - including such potentially life-saving sounds as that of an 
approaching truck or a simple honk of a car horn. 

A basic psychoacoustic observation is that any spectral prominence (such 
as a speech formant) will mask less intense sounds in its immediate frequency 
neighborhood. These neighborhoods are called “critical bands” (about 100 
Hz below 600 Hz and one sixth of the center frequency above 600 Hz). In 
addition, there is considerable masking or loudness reduction at frequencies 
higher than that of the masker. This frequency asymmetry of masking has a 
simple physiological explanation: sounds are propagated in the inner ear by 
traveling waves that go from the places where high frequencies are detected 
to those for the lower frequencies. Thus, the detectors (“hair cells”) for high 
frequencies “see” both high and low frequencies, while the low-frequency 
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detectors see only low frequencies. Hence, low frequency components can 
interfere with (mask) higher frequencies, but not the other way around. 

While much data pertaining to the masking of tones and speech by noise 
has accumulated in the literature, there is relatively little known about the 
reverse masking situation: the masking of noise, such as quantizing noise, by 
tones or speech. To fill this void, J.L. Hall and the author have performed 
numerous masking experiments of noise by tones [5.24]. While some aspects 
of masking, such as the shapes and widths of the critical bands were found 
to be similar in the two masking situations, astounding differences were also 
discovered. While a tone masked by noise becomes inaudible when its level 
falls more than 4-6 dB below the noise level in the tone’s critical band, a 
noise^ to become inaudible in the presence of a masking tone, must fall from 
20 to 30 dB below the tone level. 

Why this asymmetry? The reason becomes clear through introspection 
when listening to a noise just above its masked threshold, that is, when the 
noise is barely audible. At levels that far below the tone level, the noise does 
not even sound like an additive noise. Rather the tone, which sounds clean and 
pure without the noise, sounds a little rough or distorted - and this roughness 
is just the kind of distortion caused by quantizing noise that we want to 
avoid in APC and other digital representations of speech signals. Luckily, 
multitone signals and speech are not quite as sensitive to distortion by noise. 
However, the spectral shapes associated with the masking phenomena are in 
all cases similar and these spectral shapes should enter the design of speech 
coders in the form of weighting functions for the quantizing noise [5.25]. 
More specifically, in optimizing the subjective quality of synthetic speech, 
one should minimize the spectrally weighted quantizing noise power that 
appears in the output signal, as illustrated in Figs. 5.4 and 5.5. 

5.3.7 Instantaneous Quantizing Versus Block Coding 

Instantaneous quantizers have two grave disadvantages: the achievable signal- 
to-noise ratio (SNR) is below optimum and the quantizing noise spectrum 
is difficult to control at the desired low bit-rates. An optimum one-bit in- 
stantaneous quantizer for a memory-less Gaussian source, for example, gives 
an SNR of 4.4 dB, while rate-distortion theory tells us that 6dB can be ap- 
proached when coding long blocks of samples together [5.31]. 

This is true even though the samples are independent! This is, in fact, one 
of the main lessons of Shannon’s information theory: Its promises (of error- 
free transmission at rates below the channel capacity, for example) come true 
only in the limit of large sample numbers. (Shannon’s famous discovery is 
related to what mathematicians now call ultrametricity. In an ultrametric 
space, such as formed by long blocks of signal samples, the usual triangular 
inequality a < 5 A c, where a, 6, and c are the sides of a triangle, is replaced 
by the ultrametric inequality a < Max{b, c}, which permits only an isosce- 
les triangle whose base does not exceed its sides in length. Specifically, the 




AMPLITUDE (dB) 



5.3 Linear Prediction 



121 




Fig. 5.4. The spectral envelope 
of a vowel sound and that of the 
(idealized) flat quantizing noise re- 
sulting from minimizing the mean- 
square error. The noise would be 
audible where its level exceeds a 
level about 3 dB below the spectral 
envelope 




Fig. 5.5. The optimum noise shape 
(dashed line) for the vowel enve- 
lope spectrum shown (solid line). 
This shaped noise spectrum re- 
sults from minimizing the subjec- 
tive loudness - not the physical 
power - of the quantizing noise. 
This strategy, based on human au- 
ditory perception, is equivalent to a 
spectral weighting criterion. In the 
example shown here, the quantiz- 
ing noise, although only a few deci- 
bels below the speech spectral enve- 
lope at most frequencies, is nearly 
inaudible 
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three Euclidean distances between three normalized Gaussian vectors of N 
dimensions, N ^ are all approximately equal, giving rise to an equilateral 
triangle in multidimensional space.) 

Thus, we are forced to give up, for two compelling reasons, the conve- 
nience of quantizing the prediction residual sample by sample. Instead, we 
must consider many samples together - blocks of samples - and to select one 
code word from a given codebook to represent each block of samples. This 
“codebook” coding replaces the inappropriate instantaneous quantizing by 
vector quantization. 

The difficulty of this approach lies in the searching of voluminous code- 
books. For example, the coding of blocks of 80 samples at just 1 bit/sample 
requires a codebook containing 2^^ 10^^ different codewords! Obviously, 

such astronomical codebooks can never be searched completely - in fact, 
they cannot even be written down. 

One is therefore forced to adopt incomplete search methods, hoping that 
the best codeword found in this manner will not be too bad. One promising 
approach is tree coding^ in which never more than a given manageable number 
of alternatives is kept open during a sequential search procedure through the 
tree (see [5.37]). In this manner, searching trees of height 80 while keeping 
up to 64 different paths open at any one time during the search, it has been 
possible to represent the prediction residual by one or even as little as one- 
half bit per sample (40 bits for a block of length 80) with excellent quality - 
thanks, no doubt, also to the proper error weighting) [5.26]. 

5.3.8 Delays 

Another problem with the coding of long blocks of data is the extra delay 
it engenders. While such delays are quite tolerable in one-way transmissions, 
they may play havoc in real-time two-way communications such as telephone 
conversations. Geostationary communication satellites, for example, intro- 
duce a round-trip delay of 600 ms which is quite annoying to the talkers and 
even disruptive, especially when the parties involved are “in a hurry” or impa- 
tient. (Both speakers may start speaking at the same time, then - when they 
hear the other side speaking - stop, then start speaking again etc.) Whereas 
the telephone equipment may not break down, tempers sometimes do.^ 



^ There is an easy way to check whether one is connected by a high- altitude, 
long-delay satellite or a fast transoceanic cable: the first party saying “one” and 
the other side responding with “two” as soon as it hears the “one,” then the first 
party responding with “three” etc. If the result is “one (pause) two (pause) three” 
etc., there surely is a long delay lurking in the link. - The author once suggested 
establishing a research department at Bell to work on increasing the speed of light 
but nobody in higher management took him up on this “impossible” idea. (Actually 
it might be done - in a sufficiently intense anti-gravitational held and the ensuing 
high space-time curvature.) 
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In the early days of digital simulations and digital signal processing people 
were also afraid of extra delays due to finite processing speeds (stemming, for 
example, from the complex algorithms to incorporate subjective error criteria 
in the coder). But, the author, not knowing much about hardware, always 
thought that problems due to high processing complexity would simply go 
away sooner or later - which they did, sooner rather than later. 

5.3.9 Code Excited Linear Prediction (CELP) 

One of the results of the block coding by trees was a demonstration of how 
good speech can sound even at 0.5 bits/sample for the prediction residual. 
Might not 0.25 bits/sample, when used with a spectrally weighted error- 
criterion, still give high quality? At such a low bit rate, there is even a chance 
for an exhaustive codebook search. Suppose we make the processing time 
window as short as 5 ms. At a sampling rate of 8 kHz, this corresponds to a 
block length of 40 samples for the prediction residual. If we want to transmit 
this residual information at a rate of 0.25 bits/sample (corresponding to 
2000 b/s), then 10 bits are available for each frame. In other words, we could 
represent each block of 40 samples by one out of 2^^ = 1024 equally likely 
codewords. 

The simplest way to do this is to work with a random codebook whose 
individual samples are independently Gaussian distributed (because the pre- 
diction residual is closely approximated by a white Gaussian process). Such 
a system has been successfully simulated on a large computer [5.38]. Speech 
quality even at this low bit rate was so high that still lower bit rates might 
be possible - always observing proper subjective error criteria! 

5.3.10 Algebraic Codes 

The only stumbling block to large-scale hardware realization of stochastic 
coding systems is the time it takes to search huge codebook entries for the 
best fit. What we need are codes for which fast search algorithms, amenable 
to complex error criteria, exist. This is one area where present efforts toward 
high-quality speech coding at very low bit rates are focused. 

Fast search algorithms imply of course that useful codebooks are not 
random but possess sufficient internal structure of a geometric or algebraic 
nature. This structure can then be exploited to “factorize” the A/'-dimensional 
search space into smaller, more easily digestible, chunks. The attendant in- 
crease in computational speed is of order A/log^A, where the base b re- 
flects the degree of factorization. Ideally, 5 = 2, as in the Fast Fourier and 
Hadamard Transforms (FFT and FHT, respectively). One effort in this di- 
rection is a new code based on the FHT and permutation codes [5.39]. 
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5.3.11 Efficient Coding of Parameters 

For the first 15 years of adaptive predictive coding the parameter efficiency 
did not matter because the innovation sequence “ate up” 10 kb/s or more. 
But now, with transmission rates as low as 2 kb/s for the prediction residual 
even for very high quality, the efficient encoding of the “slowly varying” 
parameters - predictor coefficients, gain, and pit eh- loop gains and delays - 
becomes important. 

As is well known, the different parameters have different energies and dif- 
fer in their subjective importance. Thus, transformations in parameter space, 
such as Karhunen-Loeve and singular- value decomposition, are called for. To- 
gether with “dynamic bit allocation,” which allots more bits per frame to 
those components in transformed parameter space that have larger eigenval- 
ues, considerable reductions in bit-rate have been achieved [5.40]. However, as 
in other instances of dynamic bit allocation (transform coding, for example), 
the results fell short of expectations because constantly varying bit assign- 
ments (including the “zero-bits” case) can introduce noticeable spectral and 
temporal discontinuity into the speech signal. 



5.4 Waveform Coding 

For continuous analog signals (such as speech, music, or video) to be transmit- 
ted digitally, the signal has to be made discrete in both time and amplitude. 
If the signal contains frequencies only below some cut-off frequency /c, then 
- as Harry Nyquist has taught us - 2/c samples per second suffice for a com- 
plete reconstruction of the signal. The basic method of converting a stream of 
analog signal samples into digital form is by pulse code modulation (PCM). 
In PCM each signal amplitude is converted into a digital, often binary, code. 
In principle, PCM signals can be transmitted over arbitrary distances with- 
out cumulative degradation by noise if they are “regenerated” at appropriate 
intervals [5.4]. For telephone speech signals sampled at 8 kHz, 8 bits per sam- 
ple suffice (especially if amplitude compression and non-uniform quantizing 
levels are used). The resulting “benchmark” bit rate is therefore 64 kbits per 
second. 

The optimum design of the quantizer to minimize a given error criterion 
depends on the signal statistics. Rules for such quantizers were given by J. 
Max and S.P. Lloyd [5.41]. The total signal range is sectioned into a prese- 
lected number of contiguous signal ranges ( “bins” ) , each represented by one 
quantizing level. The borders between two adjacent bins lie halfway between 
the respective quantizing levels. The quantizing levels themselves (for mini- 
mizing r.m.s. error) are the weighted mean of the signal distribution for each 
bin. If the absolute error is to be minimized, the means are replaced by bin 
medians. 
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For multi-dimensional quantizers the borders between quantizing regions 
are called Voronoi cells, defined such that each point inside a given cell is 
closer to its quantized value than any other quantized value. For minimizing 
the r.m.s. quantizing error, the quantizing values are again the centers of 
gravity of the individual cells. 

Pulse code modulators used to be quite complex. In the search for simpler 
coders, de Jager and Geefkes hit upon the idea of delta-modulation (Z\Mod), 
in which only a positive or negative pulse of fixed size is transmitted at each 
sample time [5.42]. Integration at the receiver restores an approximation to 
the original waveform. While ziMod is much simpler to implement than PCM, 
it is not particularly conservative of bandwidth or bitrate because of the high 
sampling rate ( “oversampling” ) required. 

In 1950 C. C. Cutler, in connection with predictive picture coding, in- 
vented Differential Pulse Code Modulation (DPCM), a kind of multi-bit 
delta modulation [5.43]. In 1970 N. S. Jayant demonstrated the now widely 
used adaptive delta modulation (ADM) in which the quantizing steps were 
adapted to accommodate time- varying properties of speech signals [5.44]. 
Later D. J. Goodman and J. L. Flanagan demonstrated direct conversion 
between PCM, delta modulation and adaptive delta modulation [5.45]. In a 
further advance, Flanagan, in 1973, suggested adaptively quantized differen- 
tial PCM (ADPCM) and, with Jayant and P. Cummiskey, demonstrated the 
excellent speech performance at bit rates as low as 24 kb/s [5.46]. 

Subband coding and wavelets have further enhanced the possibilities of 
waveform coding [5.47]. 

Although requiring higher bit rates than parametric compressors, wave- 
form coding, combined with proper subjective error criteria, is still much 
in the running for high-quality audio coding for high-definition television 
(HDTV), and for motion-picture industry standards (MPEG). But for the 
highest compression factors, parametric compressors, especially vocoders 
based on linear prediction, reign supreme. 



5.5 Transform Coding 

In transform coding, the signal is segmented into finite-length chunks that 
are then subjected to a, usually linear, transformation, such as the Fourier 
or Hadamard transformations. Of particular interest is the so-called discrete 
cosine transform (DCT) which is much like the real part of a Fourier trans- 
form [5.48]. But because the sine terms are missing, the DCT implies a sym- 
metric input. As an example, the DCT of the (symmetric!) energy spectrum 
of a real signal is identical to the autocorrelation of that signal. 

One of the main advantages of the DCT in speech compression is that 
the transform coefficients are not all of equal perceptual importance. Sizable 
compression factors can therefore often be realized by dynamic bit-allocation 
in which the less important channels receive relatively few (or no) quantizing 
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bits. The same principle of dynamic bit-allocation is also exploited in wavelet 
and subband coding, of which the DCT can be considered a special case. 

The DCT has recently found extensive application in image quantizing. It 
has become part of the JPEG (Joint Photographic Expert Group) standard 
for the Internet, in which each 8x8 block of pixels of an image is subjected 
to a DCT. 

One of the most advanced methods of transform coding is based on 
the Prometheus orthonormal set, which, like the Hadamard and Walsh 
transforms, uses multiplication by +1 or —1 only [5.49]. In addition, the 
Prometheus orthonormal set, which is derived from the Rudin-Shapiro poly- 
nomials [5.50], has ideal energy spreading properties [5.51]. 



5.6 Audio Compression 

Much of speech compression is based on source coding. This means that the 
compression strategy is based on the characteristics of the human vocal ap- 
paratus and the constraints it imposes on possible speech signals. A prime 
example is linear predictive coding (LPC) which is based on the fact that 
the spectra of many speech sounds are governed by poles (resonances of the 
vocal tract). 

However, even some dark- age speech coders, like the channel vocoder, 
incorporate properties of human hearing: the channel vocoder preserves only 
the amplitude spectrum of speech sounds and discards the phases - which, 
perceptually, are not as significant as the amplitudes. While the human ear is 
certainly not “phase deaf,” monaural phase sensitivity is limited. Coders that 
exploit such limitations of human perception are called perceptual coders. 

Another limitation of human hearing that looms large in modern audio 
coders is auditory masking, i.e., the inability to perceive those frequency 
components that are close to stronger frequency components. The latter are 
then said to mask the weaker ones. Thus, information pertaining to certain 
weak frequency components are irrelevant. In efficient perceptual audio coders 
(PAG) this information is suppressed [5.26]. See also the overview by A. 
Gersho [5.52]. 

But no matter how much progress in speech compression the future will 
bring, books on paper, it seems, are here to stay, see Fig. 4.6. 
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The word ‘meaningful’ when used today is nearly always meaningless. 

Paul Johnson (1982) 



It depends on what the meaning of the word ‘is’ is. 

William Jefferson Clinton (1998) 



Our task now is not to fix the blame for the past, 
but to fix the course for the future. 

John Fitzgerald Kennedy (1917-1963) 



Speech synthesis from written text has been a long-standing goal of engineers 
and linguists alike. One of the early incentives for “talking machines” came 
from the desire to permit the blind to “read” books and newspapers. True, 
tape-recorded texts give the blind access to some books, magazines and other 
written information, but tape recordings are often not available. Here a read- 
ing machine might come in handy, a machine that could transform letters on 
the printed page into intelligible speech. Scanning a page and optically recog- 
nizing the printed characters is no longer a big problem - witness the plethora 
of optical scanners available in computer stores today. The real problem is 
the conversion of strings of letters, the graphemes^ to phonetic symbols and 
finally the properly concatenated sequence of speech sounds [6.1]. 

In speech synthesis from written material, one of the first steps is usu- 
ally the identification of whole words in the text and their pronunciation, 
as given by a string of phonetic symbols. But this string is only a “guide” 
to pronouncing the word in isolation and not as embedded in a meaningful 
grammatical sentence. Speech is decidedly not, as had long been innocently 
assumed, a succession of separate speech sounds strung together like a string 
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of pearls on a necklace. Rather, the ultimate pronunciation is determined 
by the syntactical function of the word within its sentence and the meaning 
of the text. This meaning can often be inferred only by inspecting several 
sentences. Thus, proper speech synthesis from general texts requires lexi- 
cographical, syntactical, and semantic analyses. These prerequisites are the 
same as for automatic translation from one language to another, and they are 
one reason why translation by machines remains difficult. (Another reason 
of course is that some utterances in one language are literally untranslat- 
able into certain other languages.) Not surprisingly, good, natural-sounding 
automatic speech synthesis from unrestricted texts is anything but easy! 

Beyond reading machines for the blind, there is an ever-increasing need to 
convert text, be it on the printed page or in computer memory, into audible 
form as speech. With the spreading Internet, a huge store of information is 
only a mouse click away for ever more people. While much of this informa- 
tion is best absorbed by looking at a printed document, in many cases an 
oral readout would be preferable: think of a driver in a moving car, the sur- 
geon bent over the operating table or any other operator of machinery who 
has his hands and eyes already fully occupied by other tasks. Or think of 
receiving text information over cable or over the air (by mobile phone, say). 
In such cases a voice output of text would be a good option to have. This 
is particularly true for people on the go who could receive their text email 
by listening to the output of a text-to-speech synthesizer. Such voice email 
would obviate the need of lugging a portable printer around the country (or 
the world). - Finally, many people on our globe cannot read; they have to 
rely on pictorial information or the spoken word. 

Still other applications of speech synthesis from text result from the great 
bit compression it permits. Waveform and parameter coding of speech signals 
allow compression down to a few thousand bits per second. By contrast, the 
corresponding written text, albeit lacking intonation, requires only a hundred 
bits per second at normal read-out rates and even less with proper entropy 
coding. In fact. Shannon, in an ingenious experiment, estimated that the en- 
tropy of printed English is but 2.3 bits per letter - one half of the entropy 
(4.7 bits/letter) if all 26 letters and the space between letters were equiprob- 
able and independent of each other [6.2]. Thus text-to-speech synthesis (in 
connection with automatic speech recognition) would allow the ultimate in 
bit compression. 

Synthesis of natural speech from unrestricted text also requires proper 
prosody: word and sentence intonation, segment durations and stress pattern. 
All three aspects of prosody have inherent (“default”) values, which govern 
the word when spoken in isolation. But necessary modifications from these 
standards depend on the structure of the sentence and, again, the intended 
meaning and mode of speaking: Is the utterance a question, an order, a 
neutral statement or what? [6.3]. 
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A related aspect of human speech is its “style” : Is the speaker shouting 
or preaching? Is he reading from a newspaper or a detective novel? How fast 
is he or she speaking? Does the speaker feel anxiety? How confident is he? 

A person can produce and recognize the intonation and type of voice em- 
ployed in coaxing, in pleading, in browbeating, and in threatening, in plea- 
sure, and in anger, as well as those appropriate for matter-of-fact statements. 
This is one of the areas of speech about which little is currently known. 

All these different styles affect not only the prosody but reach into the 
articulatory domain and inffuence the course of the formant frequencies. (For 
example, for fast speech, vowels tend to be “neutralized,” i.e. the formant 
frequencies migrate to those of the uniform- area vocal tract.) There are also 
many interesting interactions. The pause structure, for instance, infiuences 
the intonation. The beginning of a talk sounds subtly different from its end- 
ing. (This writer, apparently on the basis of such subtle linguistic cues, is 
almost always aware - he may even wake up in time to applaud - when the 
end of a lecture is near.) 

A person’s speech, supplemented by facial expression and gesture, indi- 
cates a great deal more than factual information. Some of these other func- 
tions performed by language are usually mastered later by foreigners and give 
rise to misinterpretation, sometimes making foreign speakers appear insensi- 
tive when they are simply deploying fewer resources in the language. 



6.1 Model-Based Speech Synthesis 

Most synthetic speech is “manufactured” by speech synthesizers such as lin- 
ear predictive coders (LPC), formant vocoders or “terminal analogs” of the 
vocal tract. These synthesizers may exist either as hardware or, more com- 
monly, as software. The low-level parameters (predictor coefficients, formant 
frequencies, samples of the area functions) that control these synthesizers 
are computed from a few high-level parameters (such as tongue position, 
lip rounding, etc.). These parameters are obtained from articulatory mod- 
els that incorporate the physical and linguistic constraints of human speech 
production. 

Needless to say, the algorithms necessary for these conversions are not 
exactly simple. A vast body of research has been devoted to the study of the 
human speaking process, including high-speed motion pictures of the human 
vocal cords, x-ray movies of the articulators, electrical contacts on the palate, 
hot-wire flow meters in front of the lips or nose, magnetic field probes to track 
the motions of various articulators (adorned with minuscule magnets), and 
myographic recordings from the muscles that activate the articulators. In 
addition, neural networks have been trained to speak in an attempt to learn 
more about human speaking [6.4]. 

One of the several areas in which still more research is required is the 
functioning of the vocal folds [6.5]. Future high-quality speech synthesizers 
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may also have to forego the fiction that vocal cords and vocal tract are 
completely decoupled mechanical systems. There is no dearth of research 
topics in speech synthesis! In fact, the quality of synthetic speech (or the lack 
thereof) is one of the severest tests of our linguistic knowledge. 



6.2 Synthesis by Concatenation 

One of the most seductive methods of synthesizing speech from text is by 
stringing together, or concatenating^ prerecorded words, syllables, or other 
speech segments [6.6]. This avoids many of problems encountered in phoneme- 
by-phoneme synthesis, such as the coart iculatory effects between neighboring 
speech sounds [6.7]. Still, even words do not usually occur in isolation: the 
words immediately preceding or following a given word infiuence its articu- 
lation, its pitch, its duration and stress - often depending on the meaning 
of the utterance. Thus, a (single) lighthouse keeper may advertise for a light 
housekeeper. And it is icecream I scream for. - You just can’t get away from 
meaning in speech, be it synthesis, recognition, and, perforce, translation. 

Another problem of word concatenation is the large dictionary required 
for general-purpose texts. ^ 

The size-of-the dictionary problem is of course alleviated if one concate- 
nates syllables rather than whole words. But then coarticulation effects be- 
come more complex again. To minimize the more difficult coarticulation ef- 
fects, it is best to base the dictionary on consonant- vowel- consonant (CVC) 
strings and to cut these strings in the center of the steady-state vowel, yield- 
ing demisyllables [6.8]. Another approach to divide and conquer syllables are 
diphones (vowel to postvocalic consonant transitions). 

For many languages, demisyllables minimize the coarticulation effects at 
syllable boundaries because the demisyllables are obtained from natural ut- 
terances by “cutting” in the middle of a steady-state vowel. Thus only rela- 
tively simple concatenation rules might be required - in the best of all worlds. 
But the reality of human speech is more complex and a successful concate- 
nation system may have to rely on a combination of demisyllables, diphones, 
and suffixes (postvocalic consonant clusters). 



once gave a talk in Philadelphia and had the computer deliver the introduction 
by text-to-speech synthesis using word concatenation. I wanted the machine to say 
“I just arrived from New Jersey,” but, alas, the word Jersey wasn’t in the dictionary. 
What to do? Well, Philadelphia isn’t Brooklyn, but, as I had hoped, Joy-See was 
readily understood. 
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6.3 Prosody 

I don’t want to talk grammar, I want to talk like a lady. 

{Lisa Doolittle in Shaw’s Pygmalion) 



For some time now, text-to-speech (TTS) systems have produced intelligible, 
if unpleasant sounding, speech. Much synthetic speech still has an unnatu- 
ral (“electronic”) accent and the fault lies largely at the door of prosody: 
voice pitch, segment durations, loudness fluctuations and other aspects of 
speech that go beyond the sequence of phonemes of the utterance. It has 
been shown that proper prosody is also crucial to ease of understanding. 
For example, subjects who have to perform a “competing” task do so more 
reliably while listening to high-quality speech and they tire later compared 
to subjects listening to speech with improper prosody [6.9]. And as is well 
known, improper prosody can render a foreign speaker difficult - sometimes 
impossible - to understand. 

Prosody is also heavily dependent on the gender of the speaker. And there 
is more to the gender difference than pitch height. As mentioned before, 
B.S. Atal and the writer once tried to change a male into a female voice by 
just raising the fundamental frequency. The resulting “hermaphrodite” was 
a linguistic calamity. Even changing the formant frequencies and bandwidths 
in accordance with female vocal tract physiology did not help much: the voice 
of the gynandroid never sounded very attractive. 

But there is considerable commercial interest in changing voices and ac- 
cents, not only from male to female (and vice versa), but from, say, “Deep 
South” to Oxford English (and vice versa?). 

In spite of persisting difficulties, considerable progress toward more human 
sounding, intelligible speech has been made during the last several decades. 
In his inaugural lecture at the University of Gottingen in 1970, the writer 
demonstrated the then current standard of TTS by playing a German poem 
by Heinrich Heine^, synthesized on an American computer (slightly modified 
for the occasion by Noriko Umeda and Cecil Coker). No one in the select 
audience seemed to understand more than a few words. Then the wily lecturer 
played the same tape once more, this time around with a simultaneous (but 
unannounced) slide projection of the text. Suddenly everybody understood. 
But most listeners were not aware why they understood the second playing, 
namely by reading the text. Thus (30 years ago anyhow) providing visual cues 
(preferably the complete text) was a great help in rendering TTS intelligible. 

The importance of prosody is nicely illustrated by the following obser- 
vation. Like most people I do not recognize my own voice when listening to 

^Mit Deinen blauen Augen 
siehst Du mich lieblich an. 

Da wird mir so traumend zu Sinne, 
daB ich nicht sprechen kann. 
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it over a tape recorder. However, even when speaking English (my second 
language), I notice the German accent and can tell that the speaker grew 
up in the Munsterland (don’t drop the Umlautl) where I lived a long, long 
time ago and where I haven’t lived since I was 10. In other words, in speak- 
ing English, my accent is not a generic German accent but that of a specific 
region near the Dutch border. And accent means primarily prosody. It’s the 
native prosody that is so hard to disguise in speaking a foreign language. (I 
have little trouble with the English speech sounds. In fact, some Munsterland 
words are closer to English than to high German - like water which is Water 
in Munsterland and not the high-German Wasser.) 




7. Speech Production 



Speech is of Time, Silence is of Eternity. 

Thomas Carlyle (1795-1881) 



The right word may be effective, but no word was ever as effective as a rightly 
timed pause. 

Mark Twain (1835-1910) 



Compared to something sound and simple, such as getting a playground swing 
to oscillate, speech production is a mess. Not just the subtle thought that 
should precede the speech act, even the purely physical and physiological 
processes of speech production are difficult to comprehend. Everything, but 
the teeth, that is involved in forming the sounds of speech - the glottis, the 
tongue, the soft palate, the lips - is soft, both in biological consistency and 
algorithmic determinacy of articulatory features. The production of a human 
utterance, from mental intent to emitted sound wave, is utterly complex - a 
truly elaborate production. 

One of the greatest marvels of evolution and human life is how children 
ever learn such a complicated task as speaking, seemingly without great effort. 
But then, how do they learn sucking and, a bit later, walking? The answer 
is: by an innate “preprogrammed” ability that we call instinct, in our case 
the language instinct enticingly analyzed by Steven Pinker [7.1]. 

In spite of these complexities, linguists and engineers, following well es- 
tablished scientific precedent, have succeeded in developing models of speech 
production that, though often grossly simplified, have made the process com- 
prehensible and amenable to mathematical analysis. Rational analysis in turn 
has led to sustained progress in such practical endeavors as speech recogni- 
tion, speech synthesis and speech compression (for more efficient storage and 
transmission and for enhanced digital scrambling). 
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7.1 Sources and Filters 



The first step in order to inject rational sense into an irrational situation 
is to “divide and conquer”. And this is exactly what speech scientists have 
done in trying to better understand the speech production process. They have 
divided the sources of acoustic energy (quasiperiodic puffs of air or turbulent 
air flow) from the acoustic resonators (the vocal and nasal tracts) that shape 
the power spectrum of the source [7.2]. Modern study of the motions of the 
vocal cords have shown how much they are influenced by the geometry of 
articulators downstream: the tongue, the soft palate, and, last but not least, 
the lips [7.3]. It is therefore not a little surprising that the production of 
vowel sounds can be described, quite adequately for many purposes, by a 
simple two-part model that comprises an energy source, representing the 
vocal cords, for generating quasiperiodic pulses and a Alter with multiple 
resonances emulating the action of the vocal tract. 



7.2 The Vocal Source 

Figure 7.1 shows a typical glottal waveform, i.e. the volume velocity, measured 
in milliliters^ per second, of air discharged from the vocal cords into the 
vocal tract. The waveform is characterized by a relatively slow rise during 
the opening phase of the vocal cords and a more sudden descent during the 
cord’s closing [7.4]. This basic “wavelet” is repeated, with some fluctuation 
in its shape, more or less periodically with period lengths ranging from 6 to 
20 milliseconds (ms) for most male voices, from 4 to 7 ms for female speech 
and even shorter periods for high-pitched children and screaming adults. The 
corresponding fundamental frequency ranges are 50-167 Hz for males with 
typical values clustering around 110 Hz, and 140-250 Hz for females with a 
peak near 200 Hz. 

The pitch of singing voices covers of course a wider range: for a bass singer 
the range extends from below 65 Hz (C2 in musical notation) up to 330 Hz 
(E4); for a tenor from 120 to 500 Hz; for an alto between 170 and 700 Hz; and 
for a soprano from 250 to 1300 Hz [7.5]. 

Basically, the oscillations of the vocal cords are produced by the air pres- 
sure from the lungs, which forces the cords to open to let the air escape, and 
the negative Bernoulli pressure, which pulls the cords together again once 
the air starts flowing at high speed. This mechanism is similar to the one 
employed in the Bronx cheer and the (nonwhispered) flatus. 

The short-time Fourier spectrum of the glottal wave, owing to the 
quasiperiodicity of the waveform, consists of harmonically related lines at 

^One milliliter, i.e. one thousandth of a liter, has the same volume as one cubic 
centimeter (cm^). Throughout this book, following international usage, I will use 
modern units with integer powers of one thousand (10^), expressed as milli-, micro-, 
kilo etc.) 
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Fig. 7.1. Schematic representation of airflow through the vocal cords for normal 
vowel phonation. The vocal cords (“folds”) typically open relatively slowly, in a 
manner that can be modeled by half a cosine-wave. After the maximum airflow is 
reached, they close quite rapidly with the airflow decreasing like a quarter sine- wave 



multiples of the fundamental frequency /o = 1 /Tq, where Tq is the length 
of the fundamental period. The spectral amplitudes show an overall drop-off 
with frequency, reflecting the fact that the glottal pulses are not sharp spikes 
but rounded waveforms. For normal speech effort the drop-off rate is about 
12 decibels per octave (dB/octave). (A difference of 12 dB corresponds to 
a factor of 4 in sound pressure amplitude and a factor of 16 in sound en- 
ergy density.) This sharp decline of high-frequency content is partly offset by 
a better radiation of high frequencies from the human lips. The lips, being 
more like high-frequency tweeters in size than low-frequency woofers, produce 
a high-frequency emphasis of about 6 dB/octave. 

By increasing the air pressure in the lungs from 0.5 kilopascal (kPa) for 
normal speech to 1 or 1.5 kPa, the amplitude of the glottal waveform and 
consequently the speech loudness can be substantially increased^. At the 
same time the glottal waveform becomes “sharper” (less rounded) thereby 
further increasing the high-frequency content of the speech signal that bears 
much of its linguistic information. 

Singers singing fortissimo can push their lung pressure to 5 or even lOkPa 
while, it is hoped, maintaining perfect pitch control [7.6]. But even such high- 
pressure singing does not always fill a voluminous concert hall with sufficient 
voice volume - something that pop singers have discovered long ago when 
they lunged for the first microphones. 

The glottal waveform can be measured directly by an air-flow meter con- 
sisting of a small length of an electrically heated wire cooled by the air flow 
around it. The resulting drop in electrical resistance is a measure of the flow 



^The modern unit of pressure, one Pascal, is defined as one Newton per square 
meter-, it equals ten times the old dyne per square centimeter (dyn/cm^). 




138 7. Speech Production 



velocity. Multiplication of the flow velocity by the open cross-section area of 
the vocal cords (assuming uniform flow) gives the volume velocity. 

The shape of the glottal waveform can also be obtained indirectly by 
a method called inverse filtering [7.7]. In inverse Altering, a microphone is 
placed near the lips of the speaker and the recorded acoustic signal is sub- 
jected to an electrical Altering process whose transfer function is the recipro- 
cal of the transfer function between the vocal cords and the microphone. This 
transfer function is adjusted interactively, while a tape loop of the recorded 
speech signal is played back, until a reasonable looking, smooth waveform is 
obtained. This suspiciously circular sounding “bootstrap” method is actually 
quite useful in speech research.^ 

The motion of the vocal cords has been studied by high-speed Aims and 
modeled as spring-coupled masses. Dynamical systems with two or more de- 
grees of freedom are subject to period doubling and chaotic motion [7.8]. 
Such irregular motions of the vocal cords have indeed been observed and are 
now tackled by modern chaos theory [7.9]. 

Puffs of air provide the energy for the quasi-periodic voiced speech sounds, 
such as the vowels and the voiced consonants. The aperiodic lin voiced speech 
sounds, especially the sibilants like /s/, /sh/ and /f/ but also the whispered 
vowels and whispered speech in general, receive the energy from turbulent 
air flow in the vocal tract. Such turbulence is generated particularly near 
narrow constrictions of the tract, say between the tip of the tongue and the 
front teeth, as for the /th/-sound in “teeth”. Since the energy source for 
these sounds is located somewhere between the glottis and the lips (and not 
at the glottal end of the tract), the vocal tract transmission function is not 
that of a minimum-phase all-pole Alter (see Chap. 8 for more detail on these 
concepts). Instead, the transmission function has spectral zeros (“antireso- 
nances” or spectral minima) in addition to poles (resonances). However, these 
antiresonances are not very important perceptually, because they are often 
inaudible due to auditory masking by the surrounding resonances (see Chap. 
7) . (For obvious reasons evolution has seen to it that we respond well to con- 
centrations of energy in the spectrum because they, and not the absence of 
spectral energy, signal approaching danger or sources of live food.) Turbulent 
airflow also plays a role, albeit a less important one, in the production of 
vowels and other voiced speech sounds. 

^This bootstrap situation is not uncommon in scientific endeavors. As the great 
Danish physicist Niels Bohr once remarked to Werner Heisenberg of Uncertainty 
fame while washing dishes in a mountain hut during a skiing vacation: “The dish 
water is dirty, the rags we use are dirty but in the end, after enough rubbing, we 
get sparkling clean glasses. It’s not much different with our scientific theories: we 
start with dirty data, apply wrong reasoning but (with a little bit of luck) end up 
with reasonable theories.” 

Bohr is also famous for his quip when asked whether he was superstitious when a 
visitor saw a horseshoe nailed to his door. “No,” Bohr answered, “but I understand 
that it helps even if you don’t believe in it.” 
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7.3 The Vocal Tract 

For the speech scientist the vocal tract, originally designed by evolution 
for breathing, eating, and drinking before it was coopted for speaking, is 
an acoustic resonator with variable geometry under control of the (sober) 
speaker [7.10]. The geometrical configuration of the vocal tract determines 
its resonances which in turn imprint their spectral pattern on the spectrum 
of the glottal pulses. For nasalized sounds (/m/ and /n/ and some vowels, in 
French and other languages) the spectrum is further modified by the coupling 
of the vocal tract to the nasal cavity [7.11]. The resulting spectrum is then 
radiated from the lip and nose openings to impinge on the listeners’ ears and, 
ultimately, their minds. 

From a perceptual point of view, the first two formants are the most 
important constituents of a speech signal. In fact, for most speech sounds, 
the third formant has a relatively constant frequency, usually above 2 kHz. 
(However, in English, for the retrofiex /r/ sound, as in “bird,” the third 
formant does dip below 2 kHz for adult speakers.) It therefore makes eminent 
sense to characterize the vowels by the frequencies, fi and / 2 , of their first two 
formants, usually portrayed in the / 1//2 plane, see Fig. 7.2 and the pivotal 
paper by Peterson and Barney [7.12]. 




100 300 500 700 900 

Fl IN CYCLES PER SECOND 

Fig. 7.2. Frequency of the second formant (F 2 ) versus that of the first formant 
(Fl) for five different vowels spoken by adult male speakers. The solid points are 
averages of the Peterson and Barney data. The central point represents the neutral 
vowel /a/ or “schwa” sound 
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During continuous speech, the steady-state vowel frequencies are seldom 
reached, especially in unstressed syllables. In English, Russian and many 
other languages, there is a general tendency to “neutralize” the vowels, mean- 
ing that the formant frequencies will migrate to the frequencies of the “schwa” 
sound /schwa/, also called the neutral vowel, whose formant frequencies are 
those of the uniform-area vocal tract (about 500 Hz, 1500 Hz, 2500 Hz etc. 
for the adult male). In other words, the “vowel” triangle has a tendency to 
shrink. This phenomenon is also called vowel reduction [7.13]. In Russian, for 
example, the unstressed /oh/ sound turns into a short /ah/ sound. The ten- 
dencies for neutralization increase with increasing speed of speaking. It is as 
if the tongue was too lazy to follow its prescribed contortions and preferred 
to stay closer to a “middle” position. But of course, this vowel reduction is 
mostly a matter of physical constraints and economy in the speaking process. 

The same “least-effort” principle also leads to the ubiquitous phenomenon 
of “coarticulation” in running speech. This means that the vocal tract shape 
of a given vowel blends into the shapes of the preceding and following speech 
sound [7.14]. 

Coarticulation is also very important for speech synthesis [6.15, 16]. The 
author, for his inaugural lecture at the University of Gottingen, once demon- 
strated the importance of proper coarticulation in speech synthesis by synthe- 
sizing the Latin name of the university {Georgia Augusta^ after its founder 
George II of England and Elector of Hanover) by splicing together mag- 
netic tape snippets with the sounds G-E-O-R-G-I-A A-U-G-U-S-T-A and, 
predictably, nobody in the audience understood. This little demonstration 
also shows how important it is to pay proper attention to coarticulation, and 
other articulatory constraints, in speech synthesis. These constraints reflect 
the geometry and the masses of the articulators and the hnite muscular forces 
acting on them. For speech recognition, too, building the proper constraints 
into the speech production model can improve correct recognition rates sub- 
stantially. 

Another peculiarity of many languages, not least English and American 
English, is a tendency to eschew some pure vowels and turn them into diph- 
thongs [7.13]. This tendency persists when native-English speakers speak a 
foreign language, such as Italian or Hungarian (perish the thought), lan- 
guages that abhor English-style diphthongs. (Conversely, native Hungarians 
often betray their linguistic home by pronouncing English diphthongs as pure 
vowels, such as crying “rehp” instead of “rape” when the need arises - as hap- 
pened to a friend who found his house “devastated” after returning from a 
long absence). 

7.3.1 Radiation from the Lips 

The acoustic radiation from the lips acts much like what engineers call a hrst- 
order highpass hlter. In a rough approximation, the radiation from the lips 
with opening area A can be likened to that of a little loudspeaker ( “tweeter” ) 
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or, more formally, to an oscillating piston with area A in an infinitely large 
baffle. Such an oscillating piston in turn can be well approximated by an 
oscillating sphere with a surface area equal to A, see Appendix A. The sound 
field of a spherical wave of angular frequency uj emitted by such a sphere is 
given by the velocity potential 

r 

where r is the distance from the center of the sphere, t is time, and k is the 
magnitude of the wavevector; k = uu/c = 27t/A where c is the velocity of 
sound and A is the wavelength [7.17]. 

Differentiation of (j) with respect to time and multiplying by the specific 
density of air p yields the sound pressure and differentiation with respect 
to the position coordinate and multiplying by -1 gives the particle velocity 
of the air. The ratio of the pressure to the particle velocity is called the 
characteristic impedance] for a spherical wave it is given by 

= ( 7 . 2 ) 



The radiation efficiency rj is defined as the ratio of the real part of Z(r) at 
the surface of the sphere (r = R) divided by the characteristic impedance, 
pc, of the surrounding medium: 






(7.3) 



The sound radiation of a sphere can be represented by a first-order highpass 
filter. Its cutoff angular frequency ujc corresponds to kR = 1 and thus equals 
c/R. Introducing the surface area A = 47tR^ of the sphere, we obtain for 

fc := (Vc/27t 

/c = (7.4) 



In terms of the corresponding cutoff wavelength Ac = c//c, equation (6.4) 
takes on the simple form Ac = 1.8 which is independent of c and valid for 
arbitrary units. (If the mouth area A is given in square feet (foot in mouth?), 
then the calculated cutoff wavelength is also in feet.) Thus, the larger the 
sphere or piston, the lower the cutoff frequency - a well-known result that 
has governed loudspeaker design for decades. For a lip opening with an area 
of, say, 5 • 10~^ m^ (formerly 5 square centimeters) the cutoff frequency is 
8600 Hz. Hence, sound radiation from the lips is essentially a highpass affair 
with a slope of +6dB/octave over most of the speech spectrum. 

The total radiated power P is proportional to p times the mouth opening 
area A. Thus, with (6.3), for frequencies below the cutoff, 

P Ak^R^ ^ pA^. 



(7.5) 
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Hence, for equal vocal effort, the “big mouths” have it. [Whoever said “keep 
your mouth shut’ll Did they know about the factor A? in (6.5)?] 

It is interesting to note that the lower frequency components of a speech 
signal are also radiated, albeit attenuated, through the speaker’s cheeks, espe- 
cially if these are not too thick. More precisely, the cheeks act somewhat like a 
sound-leaking single-layer wall between two apartments, i.e. like a first-order 
lowpa,ss filter with a cutoff angular frequency ujc equal to 2pclmd. Here p is 
the specific density of air, c the sound velocity in air and m the specific grav- 
ity of the cheeks and d their thickness. With pc — 414 kg m“^s“^, m = 10^ 
kg m~^ (like water) and d = 0.005 m, the cutoff frequency fc — o;c/27r equals 
26 Hz. Thus, the cheek impedance is a significant factor in the first- formant 
frequency range (150-900 Hz for adult speakers). In fact it is the control- 
ling factor for the first formant when the lips are closed, as in the closed-lip 
epoch of the voiced plosive sound /b/. While for heavy metal cheeks the first 
formant frequency would drop to near zero, the finite acoustic impedance of 
cheeks made of flesh limits the drop of the first-formant to a finite frequency. 
This is a significant effect in the production and the perception of plosive 
sounds. 



7.4 The Acoustic Tube Model of the Vocal Tract 

The shape of the vocal tract is anything but straight and simple. But fortu- 
nately for lower frequencies it is its cross-sectional area that matters most, 
not its actual shape and curvature. The vocal tract can therefore be modeled 
by a straight acoustic tube of variable cross-section A{x), where x = 0 cor- 
responds to the input, i.e. the glottis end of the tract, and x = L (typically 
0.17 m in an adult) corresponds to the output, i.e. the lip end of the tract, 
see Appendix A. For an optimum straightening strategy, the intermediate 
values of x are obtained from the flow lines of a laminar flow through the 
tract [7.18]. 

The sound transmission of a tube of variable cross-section A{x) can be 
described by Webster’s Horn Equation for the sound pressure p{xH) along 
its length: 

^ _l_ (76') 

dx‘^ A dx dx dt^ 

which for constant area [dAjdx — 0) reverts to the customary wave equation 
in a one-dimensional lossless medium [7.19]. 

One important requirement for (6.6) to be valid is that the tract support 
no cross- modes. This means, roughly, that the largest cross dimension of the 
tract be smaller than half the shortest wavelength to be considered. As a 
consequence, sound transmission in a tract as wide as 40 millimeters (mm) 
is not properly described by (6.6) for wavelengths smaller than 80 mm or 
frequencies above about 5 kHz. 
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Several methods exist for solving the Horn equation [7.20]. For smoothly 
varying cross-section A{x)^ a perturbation method borrowed from quantum 
mechanics, the Wentzel-Kramers-Brillouin (WKB) method, is particularly 
simple to use. The smoothness condition is that the relative change of area 
within one wavelength be smaller than 1: 



1 dA 
A dx 



A < 1. 



(7.7) 



For an exponentially flaring horn, A{x) ^ exp(ex), where e is the flare con- 
stant, solution of the horn equation (6.6) shows that the horn acts like a 
highpass filter with a cutoff frequency fc equal to ce/47r. Thus, horns with 
a large flare have a high cutoff frequency which makes them ineffective in 
woofer design. For a good woofer with fc — 30 Hz, we have e 1.1 m~^. This 
means that the area A{x) can only increase by a factor exp (1.1) 3 over the 

length of one meter. For a tenfold increase in area, typical for large woofers, 
the length of the horn must therefore exceed 2 m. To fit such a long horn into 
a box of reasonable size, the horn must be folded, as in fact they are in most 
modern woofers. 

The connection between geometry and resonance frequencies is particu- 
larly simple for small deviations from a tract with uniform area functions. 
For a uniform tube with an effective length of 172 mm closed at one end and 
open at the other, the resonances are odd multiples of 500 Hz, namely 500 Hz, 
1500 Hz, 2500 Hz etc. For small sinusoidal perturbations of the logarithmic 
area function, symmetric about the tract’s midpoint, all formant frequencies 
will stay fixed to first order. This is another case of articulatory ambiguity. A 
small perturbation corresponding to a half-period cosine will affect only the 
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Fig. 7.3. A cosinusoidal deformation of the vocal-tract area function that changes 
only the second formant frequency appreciably 
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first formant frequency. Similarly, a 3 half-period cosine perturbation will 
affect only the second formant frequency, see Fig. 6.3; and a 5 half-period 
cosine will change only the third formant frequency [7.21]. By contrast, per- 
turbations corresponding to full-period cosines do not change the formant 
frequencies to first-order. More generally, changes in the area function that 
are symmetric about the midpoint of the vocal tract (open at the lips) have 
no effect on the formant frequencies. This is a case of articulatory ambiguity 
- widely exploited by ventriloquists, who can produce understandable speech 
without moving their lips. Thus, calculating area functions from acoustic 
information (formant frequencies) is difficult, see Fig. 7.4, unless other con- 
straints are marshalled to resolve the ambiguity. 

As mentioned before, one method to overcome the articulatory ambiguity 
is to measure the impedance at the lips; see Fig. 2.7. The vocal tract area 
functions for the utterance /iba/ determined by lip impedance measurements 
are shown in Fig. 7.5. 

The exact change in formant frequency 6f can be calculated from the work 
SW done against the acoustic radiation pressure as the tract area function is 
deformed by SA{x): 



Sf _ 5W 



(7.8) 



VOCAL TRACT AREA FUNCTIONS FOR VOWEL [u] 




Fig. 7.4. A : area functions derived from x-ray data for the vowel /u/ (G. Fant). B: 
Smooth area function calculated from the first three formants of the vowel. Note the 
large differences especially near the glottis. However, the linguistically important 
constrictions of the vocal tract around 7 cm from the glottis and at the lips are well 
represented by the formant data 
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Fig. 7.5. Vocal tract area functions for the utterance /iba/ calculated from mea- 
surements of the lip impedance. Note the movements of the “tongue” in the tran- 
sition from /i/ to /a/ and the properly found closure at the lips during the labial 
plosive /b/ 



where W is the total energy in the normal mode considered. Equation (6.8) 
embodies the adiabatic principle^ so called because it is valid for “adiabatic” 
changes, i.e. changes (of the area function) slow compared to the resonance 
frequency. 

For large deviations from a uniform area function, its computation from 
the tract transfer function has been tackled by the mathematical concept of 
fiber bundles [7.22]. 

A detailed mathematical treatment of vocal tract acoustics and modeling 
of the tract is given in Appendix A by H. W. Strube. 

In the following section we shall introduce the reader to the concept of con- 
volution, in both its analog and digital form, in connection with the source- 
filter model of speech production. 



^The adiabatic principle is one of the most elegant and widely valid principles 
in all of physics. It was developed in 1913-16 by the Austrian-Dutch physicist 
Paul Ehrenfest (1880-1933) who proved that, in every periodic system under slow 
(“adiabatic”) changes in its parameters, the ratio of the energy to the resonant 
frequency was invariant. 

Ehrenfest, well known for his statistical urn models, received his Ph.D. in sta- 
tistical mechanics under Ludwig Boltzmann in Vienna and later became one of 
Einstein’s closest friends - in physics, music (violin), and political sympathies. 
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7.5 Discrete Time Description 



For the source-filter model of speech production, the sound pressure wave 
p{t) emanating from the lips can be described by 

p{t) = s{t) ★ h{t) (7.9) 

where s{t) is the source signal, h{t) is the impulse response of the vocal tract 
filter, including lip radiation, and ★ stands for a convolution integral: 

pOO 

p{t) = s{t — t')h{t')dt' . 

Jo 

Fourier transforming (6.9) results in 

p{lu) = s{lj) • h{uj), C ^- 10 ) 

which shows that the Fourier transform of the speech signal, p(cj), is ob- 
tained by multiplying the source transform, s(o;), by the vocal-tract transfer 
function, h{Lo). 

For discrete-time (“sampled”) signals, the proper analytic tool is the 
z-transform [7.23]. Instead of an impulse response as a function of time, h(t), 
we consider only discrete samples h[n] := h{nT)^ where T is the sampling 
interval.^ The impulse response is then represented by its z-transform 

(7.11) 

n 

Mathematically speaking, the z-transform is a generating function in which 
the customary variable x has been replaced by z“^. Thus, to know the nth 
sample, h[n] of the impulse response, we have to look at the factor of z~'^. 
For causal impulse responses, h[n] = 0 for n < 0. 

By setting z = exp(ia;T), one obtains the transfer function of the sampled 
impulse response 



= (7.12) 

n 

Note that the transfer function of a sampled response is periodic in the fre- 
quency variable (j with a period length of 27 t/T. H[uj] is therefore only con- 
sidered in a finite frequency interval, usually 



7T 7T 

— < o; < — 
T - T 



In terms of z-transforms, equation (6.10) appears as follows 



(7.13) 



P{z) - S{z) • H{z) , 



(7.14) 



^In this book a change in variable is signalled by going from parentheses ( ) to 
brackets [ ]. Fourier transforms are indicated by a circumflex " or a capital letter. 
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which can be interpreted both in the frequency domain 

P[u;] = S[u;]^H[u;] (7.15) 

and in the discrete time domain, as can be seen when we sample both the 
source signal s{t) and the speech signal p{t) and introduce their 2 :-transforms 



S{z) :=Y,s[n]z-\ 

n 

P(z) := Y.p[n]z-^ . (7.16) 

P 

Plugging these definitions into (6.14) and looking at the coefficients of z~'^, 
we see that p[n] is given by a discrete convolution: 

p[n] = ^ 5[n] • h[n — k] . (7-17) 

k 

Equation (6.17) replaces the continuous-time convolution (6.9). Thus, the 
z-transform gives both frequency or spectral information and time-domain 
information; it is the proper tool for discrete-time systems. 




8. The Speech Signal 



Take care of the sense and the sounds will take care of themselves. 

Lewis Carroll (1832-1898) 



The speech signal, as it emerges from a speaker’s mouth, nose and cheeks, is 
a one-dimensional function (air pressure) of time. Microphones convert the 
fluctuating air pressure into electrical signals, voltages or currents, in which 
form we usually deal with speech signals in speech processing. Analog-to- 
Digital converters change the analog voltages into binary (or n-ary) digital 
signals. Bandlimited speech signals (bandlimited by a telephone system, for 
example) of less than 4000 Hz bandwidth can be represented, according to 
the sampling theorem, by 8000 samples per second. Each sample can be 
quantized to 256 levels (8 bits) with little audible degradation if the levels 
properly cover the voltage range of the signal. (One or two bits per sample 
can be saved by a judicious, non-uniform choice of levels at the cost of only 
minor audible distortion.) Thus the total information rate required for a high- 
quality representation of a speech signal bandlimited to 4 kHz is 8 bits/sample 
times 8000 samples/second or 64 kbits per second. (For comparison, the bit 
rate on a stereo compact disc (CD) exceeds 1.4 Mbits/second.) The aim of 
speech compression is to reduce this bit rate as much as possible for more 
efficient storage and transmission. 

Although the speech signal is a one-dimensional function (air pressure) 
of a one-dimensional variable (time), it is generated by a plethora of parallel 
nerve commands from the brain, controlling the muscles of the various organs 
participating in the articulatory process - vocal cords, tongue body, tongue 
tip, lips, soft palate (velum), etc. These nerve commands do not only occur in 
parallel, they are noticeably desynchronized to compensate for different de- 
lays on different nerve fibers and to promote the numerous “coarticulatory” 
effects observed in speech in which the articulation of one speech sound is 
substantially influenced by its neighbors. These phenomena are documented 
in great detail for one widely understood language in the book Acoustics of 
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American English Speech: A Dynamic Approach by J. P. Olive, A. Green- 
wood, and [8.1]. 

In spite of the great complexity of the speech production process - from 
thought and intent in the brain to the acoustic signal - a more simplistic 
view of speech signals, disregarding most of these complexities, suffices for 
simple speech signal compression. But some of these complexities cannot 
be safely ignored in speech recognition and particularly in speech synthesis 
from written material. Not paying proper attention to the human production 
process is precisely the reason why machine speech, to this day, has a flavor 
of, well, machine speech. Exorcising the “electronic accent” from synthetic 
speech is a continuing challenge. 



8.1 Spectral Envelope and Fine Structure 

Of the many distinctive features of speech that even speech compression 
cannot ignore is the dichotomy between voiced and unvoiced sounds. For 
voiced sounds, like the vowels and voiced consonants in non- whispered speech, 
the vocal cords vibrate more or less periodically, chopping up the air stream 
from the lungs into individual puffs of air at a fundamental frequency /o 
ranging from roughly 50 Hz (low male) to 300 Hz (high female). The resulting 
“quasiperiodicity” of the speech signal is manifest both in the waveform and 
the short-time spectrum: the waveform shows a repetitive pattern at the rate 
/o and the spectrum has equidistant peaks (“lines”) at integer multiples of 
the fundamental frequency /o (“harmonics”), see Fig. 4.2. 



8.2 Unvoiced Sounds 

During unvoiced sounds, the vocal cords do not vibrate and they do not 
undulate the air flow from the lungs. (But they may be nearly closed causing 
audible friction as for the /h/ sound.) The acoustic energy is produced by 
turbulence at one or several narrow air passages in the mouth (tongue tip 
against or between the teeth, tongue against the palate etc.). This turbulent 
energy has a “smooth” spectrum like that of noise, without a line structure, 
see Fig. 4.3. 



8.3 The Voiced-Unvoiced Classification 

But not all speech sounds are either purely voiced or turbulent. The voiced 
fricatives /v/ as in veal^ /z/ as in zeal^ and / 3 / as in pleasure are voiced, 
because the vocal cords vibrate, and they are turbulent because of noise gen- 
erated at narrow constrictions in the vocal tract. The speech signal therefore 
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contains a periodically modulated noise, a fact ignored by speech synthesizers 
that distinguish only between voiced and unvoiced sounds. 

The spectral envelope is defined as the smooth outline drawn over the 
relative maxima of the short-time spectrum. It is clear that this is not a 
precise definition. A more quantitative definition can be based on the low 
quefrencies of the cepstrum, see Chap. 10. In other words, considering the 
logarithmic spectrum as a signal and smoothing it with an ideal lowpass filter 
would give the spectral envelope, as shown in Figs. 4.2,3. 

The distinction between spectral envelope and fine structure is not the 
same as that between the transfer ( “filter” ) function of the vocal tract and the 
vocal source signal. The latter has its own spectral envelope that is included 
in the overall spectral envelope of the speech signal. Most speech compression 
systems separate the spectrum into an overall spectral envelope and the re- 
maining time structure (if any) with a flat spectral envelope. Thus, vocoder 
channel signals and linear prediction coefficients, for example, usually include 
aspects of the excitation function - another fact that is easily overlooked in 
uni (n) formed analyses of such systems. 



8.4 The Formant Frequencies 

The most prominent spectral feature of vowel spectra are the peaks in the 
spectral envelope caused by the resonances of the vocal tract. Following 
musicology, these vowel resonances are called formants because they shape 
(“form”) the spectrum. The spectrum determines the sound quality or timbre 
that we hear. The positioning and even more so the dynamics of the formant 
frequencies define the syllables and words that we perceive as speech. The 
first three formant frequencies fall below 3000 Hz for adult speakers. The third 
formant frequency falls near 2600 Hz for most vowels, except the retrofiex /r/ 
sound. Thus, vowels can be largely distinguished on the basis of their first 
two formant frequencies. 

The formant frequencies in turn are determined by the geometry of the 
vocal tract as illustrated in Fig. 2.5. Depending on the height of the tongue 
body, phoneticians distinguish between high, mid, and low vowels. Similarly, 
the different positions of the tongue body in the for war d/back ward direction 
lead to the distinction between front, central, and back vowels. For example, 
/i:/ as in hee is a high front vowel, while /ai/ as in father is a low back vowel. 

Vowels with extreme front /back positions are also called tense. Thus /i:/ 
as in bee is tense, but /i/ as in bid is lax (not tense). The laxest of all vow- 
els is the so-called schwa sound or neutral vowel /a/ as in about, a sound 
that abounds in English. The vocal tract for the /schwa/ sound has a nearly 
constant cross-sectional area along its length. It is the vowel that requires 
the least muscular effort to produce - the laziest vowel, so to speak. It is 
interesting to ask whether this makes English harder to understand, by ma- 
chines and even humans, than some other languages, like Italian, that have a 
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smaller tendency to neutralize their vowels in unstressed positions. (Russian 
is another great neutralizer.) 

English vowel phonemes are often described in terms of binary (+ or -) 
distinctive features which includes the feature round for the degree of lip 
rounding [8.2,3]. Thus, /i/ has the distinctive features + high, - low, - back, 

- round, and + tense. The feature vector for /i/ is H and that for 

/a/ is h H h. 

English, including American English, is also rich in diphthongs, boasting a 
total of hve such combinations of vowels with the glides /j/ and /u/: /aj/ as 
in bite, /ej/ as in bait, /oj/ as in hoy, /slU/ as in boat, and /ou/ as in boat. 
Diphthongs occur more frequently in English than in, say, French. In fact, 
one of the difficulties of proper English pronunciation for a French speaker 
is to know how to mispronounce certain French words. Thus, in English, the 
French valet invariably comes out as /valej/ with a diphthong at the end. 
Conversely, native speakers of French (and Hungarian) have a tendency to 
substitute pure vowels for English diphthongs. 

In Italian of course all vowels (and consonants, including double conso- 
nants) are pronounced separately. Thus, Europe, which in English is barely 
two syllables, comes out in Italian as Eh-oo-roh-pa. Would that all languages 
were pronounced like that - it would make speech recognition a lot simpler! 




9. Hearing^ 



Language is the dress of thought. 



Samuel Johnson (1709-1784) 



One of the best hearing aids a man can have is an attentive wife. 

Groucho Marx (1895-1977) 



Hearing is one of the senses that evolution has bestowed on us to better 
survive in a complex and not always friendly environment. In the process, 
“natural selection” has wrought some real marvels. At the threshold of hear- 
ing people can hear sounds just above the thermal limit of molecular Brow- 
nian motion in the inner ear. At the other end of the loudness scale, our ears 
can cope with intensities a thousand billion times greater than the threshold 
value. 

The protective function of our ears is enhanced by the fact that sound 
travels around corners - not just in straight lines as light rays do. Thus 
we are warned even of invisible dangers. And wisely, nature did not supply 
us with “ear-lids” to shut ourselves off from the sounds of approaching dis- 
aster. In contrast to our eyes, our ears are always on guard. Our hearing 
analyzes sounds with respect to two important dimensions: frequency and 
direction [9.1]. The frequency analysis in the inner ear, a marvel of selec- 
tivity and sensitivity, allows us to detect weak spectral prominences in the 
presence of strong broadband noises. As a special bonus, so to speak, spec- 
tral analysis enables us to perceive spectrally coded signals, such as music, 
song, and above all speech. One of the foremost features of human hearing 
is masking [9.2]. Auditory masking means that one (loud) sound makes an- 
other (soft) sound inaudible. Masking is most effective at frequencies near 
the frequencies contained in the masker. In addition, there is upward spread 



^Adapted in part from Proc. IEEE 63, 1332-1350 (1975). 
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of masking^ meaning that frequencies above the masker frequencies are also 
rendered inaudible or reduced in loudness. 

This upward spread of masking results directly from the anatomy of the 
inner ear and the traveling waves on the basilar membrane: the low fre- 
quencies in a signal traverse the places along the basilar membrane in the 
inner ear where higher frequencies are detected. Thus, strong low frequencies 
can “swamp” high frequencies. By contrast, high frequencies are strongly at- 
tenuated beyond their place of detection and have therefore relatively little 
masking effect on the lower frequencies. The basilar membrane is in effect a 
nonuniform transmission line with progressively lower cut-off frequency. 

Upward spread of masking is the principal cause for the loss of speech 
intelligibility for most older people. The fact that their threshold of hearing 
at 2000 Hz, compared to 500 Hz, is elevated by perhaps as much as 60 dB 
(a million- fold drop in sensitivity to sound intensity) is an affliction that 
they could live with in many situations, for example when listening to mu- 
sic.^ But in the presence of interfering sounds, especially those containing 
low frequency components, below 500 Hz say, masking and especially upward 
spread of masking will drown out many intelligible speech sounds that have 
their most important spectral components between 500 Hz and 2500 Hz. 

This predicament is further aggravated by the fact that in many loca- 
tions, a crowded restaurant for example, the interfering noises are other peo- 
ple’s voices whose /oic^-frequency components are little absorbed by draperies, 
rugs and even “acoustic ceilings.” (As mentioned before, the most effective 
countermeasures to copious clamor include small tables with large distances 
between tables - and, by the way, guzzling less disinhibiting liquids by the 
clientele.) 

By contrast, for the speech-synthesizing scientist, masking is a good thing. 
It means that he can “hide” the inevitable quantizing noise under the speech 
spectrum by tailoring it so as to maximally exploit auditory masking [9.3]. 
High-quality speech signals have thus been synthesized from bit streams of 
less than 1 bit per sample [9.4]. 

People have thought about their ears (and those of many animals) for 
a long time. Intelligent speculation, supplemented by experiment (and vice 
versa), has brought us a long way toward understanding how the ear works 
- from subtle monaural phase effects {verboten by Ohm’s law of acoustics) 
to expansive binaural stereophony (welcomed by almost everyone). Much of 
this newfound knowledge, some of it acquired only very recently, has found 
its formal expression - as it has in other fields - in models: mathematical 
models or physical models (or both) of how the ear “does it” - or might do 
it if it had been designed by fanciful model builders instead of by pragmatic 
evolution. 



^Auditory researchers are at a loss to explain why such a sharp loss in high- 
frequency hearing does not play havoc with musical enjoyment - perhaps older 
people get used to the spectral distortion as it builds up gradually over the years. 
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This chapter gives a brief introduction to the human ear - not with claims 
of exhaustiveness, but with the intent to give a flavor of what is going on in 
a very active field concerning a fascinating, and ultimately still mysterious, 
subject; Homo sapiens’ sense of hearing. 



9.1 Historical Antecedents 

An understanding of our sense of hearing has been a goal of human sci- 
ence and speculation since the very beginning of civilization. As early as the 
first century BC, the Roman poet and philosopher Lucretius postulated little 
grains of sand in the inner ear responding to different tones. Lucretius’ theory 
constituted what we would call today a “particle theory” of sound, anticipat- 
ing in a most amusing manner the phonons of modern quantum acoustics. In 
his book [9.5] he held forth on speech and hearing as follows. 

“In the first place, all forms of sound and vocal utterance become 
audible when they have slipped into the ear and provoked sensation by 
the impact of their own bodies. The fact that voices and other sounds 
can impinge on the senses is itself a proof of their corporeal nature. 
Besides, the voice often scrapes the throat and a shout roughens the 
windpipe on its outward path. What happens is that, when atoms 
of voice in greater numbers than usual have begun to squeeze out 
through the narrow outlet, the doorway of the overcrowded mouth 
gets scraped. . . 

“Again, you must have noticed how much it takes out of a man, 
and what wear and tear it causes to his thews and sinews, to keep on 
talking from the first glow of dawn till the evening shadows darken, 
especially if his words are uttered at the pitch of his voice. Since much 
talking actually takes something out of the body, it follows that voice 
is composed of bodily stuff. 

“When we force out these utterances from the depths of our body 
and launch them through the direct outlet of the mouth, they are cut 
up into lengths by the flexible tongue, the craftsman of words, and 
moulded in turn by the configuration of the lips. 

“It often happens that a single word, uttered from the mouth 
of a crier, penetrates the ears of a whole crowd. Evidently, a single 
utterance must split up immediately into a multitude of utterances, 
since it is parcelled out amongst a number of separate ears, imprinting 
upon each the shape of a word and its distinctive sound. Some of these 
utterances as do not strike upon the ears float by and are scattered 
to the winds and lost without effect. Some of them, however, bump 
against solid objects and bounce back, so as to carry back a sound 
and sometimes mislead with the replica of a word. . . 
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“I have observed places tossing back six or seven utterances when 
you have launched a single one: with their tendency to rebound, the 
words were reverberated and reiterated from hill to hill. According 
to local legend, these places are haunted by goat-footed Satyrs and 
by Nymphs. Tales are told of Fauns, whose noisy revels and merry 
pranks shatter the mute hush of night for miles around. . . 

“There remains the problem, not a very puzzling one, of how 
sounds can penetrate and strike on the ear through media through 
which objects cannot be clearly perceived by the eye. The obvious 
reason why we often hear a conversation going on through closed 
doors is that an utterance can make its way intact through circuitous 
fissures in objects impervious to visual films. For these are broken 
up, unless they are passing through straight fissures such as those 
in glass, which is penetrable by any sort of image. Again, sounds 
are disseminated in all directions because each one, after its initial 
splintering into great many parts, gives birth to others, just as a spark 
of fire often propagates itself by starting fires of its own. So places 
out of the direct path are often filled with voices, which surge round 
every obstacle, one sound being provoked by another.” 

As fantastic as some of Lucretius’ ideas may sound to the modern ear, we 
can recognize in his writings concepts which today, 2000 years later, form the 
very basis of our understanding of sound: energy, reverberation, diffraction, 
and even Huyghens ’ Principle (here likened to the spread of fire) . 

Nevertheless, many centuries were to pass before more quantitative obser- 
vations about sound and hearing emerged. In the 18th century, Tartini [9.6], 
the noted Italian violinist, described his “terzi suoni” - “third tones” that 
the ear itself manufactures from two tones played simultaneously. These tones 
occupy a very important place in our attempts to understand the workings 
of the inner ear and particularly its nonlinear behavior. Contrary to long 
held views, the ear is not a highly linear receiver, even at very low sound 
intensities. Rather, “combination tones”, such as those observed by Tartini, 
become in fact audible near the very threshold of hearing where the mechan- 
ical motions in the inner ear are fractions of Angstroms (10“^mm) or the 
diameter of the hydrogen atom! The origin of these nonlinearities is still not 
completely clear, but the fact that they occur at atomic dimensions strongly 
implicates molecular processes - molecular processes, moreover, that appear 
to be intimately connected with the metabolism in the inner ear because the 
nonlinear phenomena observed in animals change when the blood supply to 
the ear is interrupted. 

Besides Tartini, numerous other composers and performers of music con- 
tributed observations (or speculations) on our sense of hearing. But sustained 
research on the ear did not begin until the middle of the 19th century. 
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9.2 Thomas Seebeck and Georg Simon Ohm 

Contemporary research into hearing had its inception with the work of 
Thomas Seebeck (1770-1831), Georg Simon Ohm (1789-1854), and Hermann 
von Helmholtz (1821-1894). 

Seebeck discovered what is now known as “periodicity pitch,” i.e. the 
sensation of a pitch-like sound quality without the presence of a physical 
component in the acoustic stimulus at the perceived frequency [9.7]. Seebeck’s 
observation and subsequent work by Jan Schouten [8.8, 9] and his Dutch 
school on “residue pitch” (so named because a sensation of pitch remains even 
after the removal of the corresponding frequency component in the signal) 
was one of the great discoveries in hearing. 

Ohm postulated his acoustic “phase law” which said that the perceived 
quality of a sound depended solely on its power spectrum and was inde- 
pendent of the phase angles of its frequency components. This phase law, 
although shown by modern research to admit exceptions, is one of the funda- 
mental facts of “psychoacoustics” as the study of hearing by listening tests 
has come to be called. Together with physiological studies and mathemati- 
cal modeling (now called “computational hearing”), psychoacoustic experi- 
ments, using elaborate acoustic signals, play a central role in modern hearing 
research. Well-designed psychoacoustic experiments allow us to penetrate, as 
it were, through the ear to the very centers of consciousness in our brains. 
And although Ohm and his contemporaries did not have the sophisticated 
equipment modern researchers enjoy (particularly digital computers and sig- 
nal processors) for the generation of precisely tailored sounds. Ohm is clearly 
the father (or perhaps grandfather) of the psychoacoustic approach to the 
ear. 



9.3 More on Monaural Phase Sensitivity 

One of the most fascinating problems in hearing - one that has puzzled psy- 
choacousticians, hi-fi fans and laypersons alike - is the ability, or “inability,” 
of the human ear to perceive “phase.” As already mentioned, as long ago 
as the middle of the last century. Ohm formulated his famous Acoustic Law 
which states that aural perception depends only on the amplitude spectrum 
of a sound and is independent of the phase angles of the various frequency 
components contained in its spectrum. 

In order to avoid obvious violations of Ohm’s Acoustic Law, we are forced 
to make its language more precise in several respects. Thus we have to add the 
modifier “short-time” before “amplitude spectrum” in the above formulation. 
Otherwise, a counterexample to the phase law could easily be constructed as 
follows: 
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1. Take a 100-second long segment of a speech signal; 

2. calculate its discrete Fourier transform (assuming 100-s periodic repeti- 
tion). This yields a frequency component every 1/100 Hz; 

3. randomize the phase angles, i.e. choose each phase angle independently 
from a uniform distribution between 0 and 27 t (0° and 360°); 

4. calculate the inverse Fourier transform. 

The result is a signal which, for all practical purposes, during intervals 
shorter than 100 s, looks and sounds like a Gaussian noise (with an energy 
spectrum equal to that of the speech signal). Thus, manipulating the phase 
angles in this case has not only altered the acoustic quality but completely 
changed the signal from intelligible speech to random noise. 

If instead of taking the Fourier transform over 100 s, we had performed 
the phase operation on a Fourier transform over time intervals corresponding 
to the speech- analysis time of the ear, say 50 ms, the acoustical quality would 
have remained intact - at least on informal listening. 

Thus in speaking of the “phase deafness” of the ear, we must remember 
that we are talking about short-time spectra. 



9.4 Hermann von Helmholtz and Georg von Bekesy 

Helmholtz was interested not only in the sensations of tone; he also did con- 
siderable physiological work on the anatomy of the ear. Based on his obser- 
vations, Helmholtz propounded a resonance theory of the inner ear which, 
although no longer acceptable in a literal sense, caught the essence of the 
inner-ear mechanics: frequency selectivity [9.10]. 

In our own century, Georg von Bekesy (1899-1972) [9.11] discovered the 
all-important traveling waves on the basilar membrane (BM) and, as a for- 
mer telephone engineer, correctly recognized the medium of these traveling 
waves as a nonuniform transmission line: high frequencies travel only a short 
distance on the BM and are then rapidly attenuated, while low frequencies 
travel farther along the BM - the lower the farther - before being stopped. 
This lowpass behavior, together with a local resonance, leads to the observed 
frequency selectivity of the BM motion, seen first by Bekesy himself and more 
recently by Johnstone, Rhode [9.12], Kohlloffel, Wilson, and Helfenstein using 
the most advanced tools that physics has to offer such as the ultrasensitive 
Mossbauer-Doppler effect, laser interferometry, and capacitive microphone 
probing. 

9.4.1 Thresholds of Hearing 

Bekesy, who was awarded the 1961 Nobel Prize for his work on hearing, is 
also remembered for his audiometer, now widely used in auditory research 
and clinical audiometry. In Bekesy audiometry, thresholds of hearing are 
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measured by an “up-down” tracking method: the subject keeps a button 
pressed as long as he or she hears the signal whose amplitude is progressively 
reduced in small steps. Once the signal becomes inaudible, the subject releases 
the button whereupon the signal amplitude is increased in small steps - until 
the subject hears the signal again and starts pressing the button anew. H. 
Levitt has considerably extended Bekesy’s up-down methods to yield more 
accurate results [9.13]. 

In addition to measuring absolute thresholds of hearing as a function of 
frequency (for selecting the proper hearing aid, for example), the Bekesy 
method is also used for determining masked thresholds, for instance of a tone 
“buried” in noise. 

9.4.2 Pulsation Threshold and Continuity Effect 

In addition to absolute and masked thresholds, another kind of threshold has 
gained prominence since the 1970s: the pulsation threshold introduced by T. 
Houtgast to study lateral inhibition in hearing (in analogy to “Mach bands” in 
vision) and to find a psychophysical equivalent of neurophysiological “tuning 
curves.” 

In the pulsation threshold paradigm, short (ca. 100 ms) test-tone presen- 
tations alternate with short bursts of noise. For low tone levels, the tone is 
of course completely masked and inaudible. For high enough tone levels, the 
tone is heard to alternate with the noise, as physically presented. By lower- 
ing the tone level progressively in small decrements, a situation is eventually 
reached when the short tone bursts are heard as a continuous tone. This is the 
pulsation threshold: above it the tone percept pulsates, as presented whereas 
just below it the tonal percept is continuous. 

The reason for this remarkable phenomenon is thought to be the continu- 
ity effect, observed in all sensory modalities, not least vision. The continuity 
effect leads to a stimulus being perceived as still present even after it has been 
turned off if the turning off leaves no perceptual trace. (Apparently evolution 
has found it wise to “rig” perception to suggest that a danger that has not 
perceptually gone away is still present.) 

One of the more impressive demonstrations of a pulsation threshold, 
due to G.A. Miller and J.C.R. Licklider, uses a speech signal alternatingly 
turned on and off: 100 ms of speech, 100 ms silence, 100 ms speech etc. Such a 
chopped-up speech signal is of course unintelligible. Naturally, adding a lot of 
noise to the chopped speech does not improve its intelligibility. However, at 
a certain noise level (the pulsation threshold), the speech sounds continuous. 
This continuity effect is so convincing that listeners are sure to be hearing 
uninterrupted speech and would actually understand it “with a little extra 
effort.” Needless to say, they don’t understand a thing. 

For a binaural example of the continuity effect, see Fig. 9.3. 
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9.5 Anatomy and Basic Capabilities of the Ear 

Figure 9.1 shows a sketch of the outer, middle, and inner ear based on the 
anatomical studies of Helmholtz. Sound waves striking the outer ear are con- 
ducted through the external ear canal to the eardrum at the entrance to the 
middle ear. 

The middle ear contains three small bones (ossicles) which transmit the 
sound vibrations to the oval window at the entrance to the inner ear. 

The inner ear or cochlea (named after its snail-like appearance) is filled 
with fiuid and separated by membranes into several ducts. 



9.6 The Pinnae and the Outer Ear Canal 

In spite of their relatively simple construction, compared to the complex 
structure and innervations of the inner ear, the pinnae (“auricles”) and outer 
ear canal perform important functions in sound localization. 

It had long been a mystery how people were able to localize sound in the 
“median plane” (an imaginary vertical plane, through the head normal to 
the connecting line between the ear drums and equidistant from them). In 
the horizontal plane, the ability to distinguish different directions of arrival 
of sound waves at his ears has been explained by the intensity and phase 
differences of the sound waves at the two ears [9.14]. For sound sources in 
the median plane (and symmetric head shapes), however, the sound waves 
impinging on the left-ear and right-ear are identical functions of time for all 
angles of elevation of the sound source. Yet for a large variety of signals, 
people can easily distinguish between the forward direction (elevation angle 
0°), the overhead direction (90°), the rearward direction (180°), and even 
several intermediate directions [9.15]. 



9.7 The Middle Ear 

The middle ear is perhaps best known to suffering humanity through the 
vicious infections it can contract. The middle ear, however, is also the site 
of more agreeable happenings. Our ears have evolved to a state of perfection 
that seems to leave little room for redundancy. Thus it is not surprising to find 
some important auditory functions being implemented in the space between 
eardrum and oval window (the entrance to the inner ear, see Fig. 8.1). 

By far the most important of these is the impedance matching between 
the airborne sound in the outer ear and the fiuidborne sound in the inner 
ear. The characteristic impedance of air is 414 kg • m~^ • s“^ and that of wa- 
ter 1 480 000 kg • • s“^ or almost 3600 times greater. However, the input 

impedance of the inner ear, although filled with a water-like liquid, is con- 
siderably smaller because the liquid can bulge out the membrane over the 
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Fig. 9.1. Schematic drawing of the middle ear and inner ear. Sound “caught” by 
the outer ear is transmitted via the middle ear to the ear drum, where it causes 
the ossicles to vibrate and excite the inner ear. In the inner ear (the snail-like 
“cochlea”) the sound, now fluid-borne, travels along the basilar membrane where it 
is converted by the hair cells to electrical pulses in the acoustic nerve, see Fig. 8.2 



round window. Nevertheless, a considerable amount of impedance transfor- 
mation has to be accomplished in going from outer to inner ear. Most of this 
transformation is effected by the ratio of the eardrum area to the area of 
the stapes footplate. The lever ratio of the ossicle motion also plays a role, 
and the combined effect is an impedance transformation of about 1 to 20, 
improving power transmission through the middle ear more than fivefold. 

The other important middle-ear function is a “gain control,” mediated by 
what is known as the acoustic reflex [9.16], which protects the delicate inner 
ear from overloading and possible destruction. 

At very high sound levels, the middle-ear transmission becomes nonlinear 
with a predominant quadratic term in its input-output amplitude character- 
istic. This nonlinearity produces combination tones if two or more primary 
tones are applied simultaneously to the outer ear. Thus if two large ampli- 
tude tones with frequencies fi and /2 are presented to one ear, a “difference 
tone” of frequency /2 — fi can be heard. Its amplitude increases proportional 
to the product of the two primary amplitudes - as would be expected for a 
quadratic nonlinearity.^ 



^The difference tone /2 — fi can best be demonstrated by slightly varying fi or 
/2 in frequency so that the diffference also changes in frequency. More generally, in 
a multitone complex, a tone varied in frequency becomes more easily identifiable; 
it perceptually “pops out” from a stationary background. This phenomenon is an 
instance of a pervasive psychophysical fact: Our attention is drawn to changes in a 
stimulus. 
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The sum tone with frequency fi + /2 can also sometimes be heard but 
it is usually much fainter due to masking and the low-pass character of the 
middle-ear frequency transfer function which attenuates frequencies above 
about 1 kHz. 

A particularly impressive demonstration of middle-ear distortion can 
be obtained with two broadband noises obtained by randomly frequency- 
modulating two carriers with a fixed frequency difference Af. These noises 
are applied to the ear via two separate transducers (to circumvent possible 
distortion in the transducer). At low sound pressure levels, the linear sum 
of the noises is heard. This is just another broadband noise with no audible 
periodicities. However, at sufficiently high levels, a tone of frequency Af is 
heard. Since this tone is not present in the stimulus, the tone must have been 
“manufactured” by the ear itself. Furthermore, this tone is not just “subjec- 
tive” (meaning that it is created in higher nervous centers) but is physically 
present in the middle and inner ears. In fact, it can be cancelled by a tone of 
frequency Af and of proper amplitude and phase applied to the outer ear. 



9.8 The Inner Ear 

The “cochlea” (from the Greek word for a snail with a spiral shell) in the 
inner ear is the frequency-selective part of our hearing organ. A cross section 
through the cochlea can be seen in Fig. 9.2 showing the cochlear duct with 
its three fluid-filled channels separated by membranes. 

One of these membranes, the basilar membrane (BM), supports the organ 
of Corti, the sense organ of hearing. The organ of Corti contains the hair 
cells which convert the relative motion between BM and tectorial membrane 
into nerve impulses. There are two kinds of hair cells: inner and outer hair 
cells; and although much was known about the organization of the cochlea 
receptor the raison d’etre for these two kinds of cells was long a mystery. It 
is now believed that the inner hair cells are the primary receptors converting 
mechanical motion of the BM into electrical impulses and that the outer hair 
cells feed energy into the BM to compensate for its mechanical losses thereby 
increasing its sensitivity and frequency selectivity. 

The first to ascribe frequency-selective properties to the BM was Helm- 
holtz who visualized it as a succession of tuned strings (as in a piano) resonant 
at different frequencies. However, when Bekesy actually looked through his 
microscope to observe the vibrations of the BM under acoustic stimulation, 
he saw traveling waves, traveling (with decreasing velocity) from the stapes 
at the base of cochlea to the helicotrema at the apex about 35 mm from 
the stapes. Figure 9.3 illustrates such a traveling wave, with a frequency of 
200 Hz, at two instants in time separated by 1.25 ms or one-quarter period. 
At the first instant, the deflection of the BM (grossly magnified) is shown 
by the solid line. A quarter-period later, the deflection corresponds to the 
short-dashed line. During this time, the minimum (the negative peak) of the 
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Fig. 9.2. Cross-section through the inner ear showing the basilar membrane, along 
which sound waves travel, and the hair cells which transduce mechanical vibrations 
into electrical pulses (and vice versa) 




Fig. 9.3. Wave traveling along the basilar membrane. Its amplitude first increases 
slowly and then, beyond its place of resonance, drops rapidly. The phase velocity 
decreases along the entire length of the membrane, as evidenced by the decreasing 
wavelength 
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wave has traveled from about 27 mm to 28.5 mm - corresponding to a phase 
velocity of 1.2 m/s, or 1/300 of the velocity of sound in air (and an even 
smaller fraction of the sound velocity in water). 

Simple inspection of Fig. 9.3 shows that, at larger distances, the phase 
velocity is even lower, whereas nearer to the stapes it is considerably higher. 
But phase (and group) velocities on the BM are not only a function of space 
but depend also on frequency. In the language of the electrical engineer, the 
BM is a nonuniform (dispersive) transmission line. 
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Fig. 9.4. Envelope of the waves traveling along the basilar membrane. The lower 
the frequency, the farther from the stapes a wave will peak 

The two long-dashed lines in Fig. 9.3 trace the positions of the wave’s 
positive and negative peaks, respectively, as they travel along the BM. This 
wave “envelope” reaches a highest value near 28 mm for a frequency of 200 
Hz. The lower the frequency of the wave, the farther along the BM it travels 
before it is attenuated, as can be seen in Fig. 9.4 which shows wave envelopes 
for 4 different frequencies. Thus, while a wave of 300 Hz travels about 25 mm 
before it is attenuated, a wave of 100 Hz reaches its maximum at about 30 mm. 

Figure 9.5 shows the propagation of short pulses of alternating sign along 
the basilar membrane. At its input end, the waveform is fully preserved. But 
for places on the basilar membrane with increasing distance from the input, 
an increasing degree of lowpass filtering becomes evident until, beyond 28 mm, 
only two and finally just one Fourier component (the fundamental frequency) 
“survives.” 

A simple electrical model of the BM is shown in Fig. 9.6. The inductances 
represent inertia of longitudinal and lateral fluid motions and the mass of the 
BM. The capacitances stand for the elasticity of the BM and the resistances 
represent the losses. The transverse branches of this ladder network are series 
resonance circuits with decreasing resonant fequency from left (input) to 
right. For a single frequency, the impedances of the transverse branches can 
be approximated by capacitances to the left of the resonance, a resistance 
at the place of resonance, and inductances beyond. Thus, in the language of 
electrical engineering, the BM acts like a low-loss LC-delay line to the left 
of the resonance place and an inductive attenuator to its right. The energy 
absorbed by the resistance goes to stimulating the hair cells attached to the 
place of resonance for the frequency considered. Although highly simplified, 
this model captures the salient features of BM mechanics. 
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RESPONSES OF BASILAR MEMBRANE 




TIME (mSEC) 

Fig. 9.5. Propagation of short pulses of alternating sign along the basilar mem- 
brane, from its input at the stapes (“basal end,” shown at the bottom) to the 
far end (“apex”). As these pulses travel along the basilar membrane, nonlinearly 
growing time delay and an increasing degree of lowpass filtering (smoothing) can 
be observed. At the far end, only one sinusoidal component of the pulse train, its 
fundamental frequency, remains. (Computer simulation by J.L. Flanagan) 

One of the mysteries of the inner ear is its ability to convert the energy of 
incoming sound waves to nerve stimulation with minimum loss and reflection 
along the BM. This is all the more astonishing if one considers that the BM 
has to cope with frequencies from 20 Hz to 20 000 Hz - a range of one to one 
thousand! For this to happen the logarithmic rate of change per wavelength of 
the parameters determining the resonance behavior of the BM must be small 
compared to 1 - otherwise some or most of the incident sound energy will be 
reflected before it reaches its proper place to stimulate the nerves. The fact 
that evolution has “fine-tuned” the basilar membrane to accomplish this feat 
is called the cochlear compromise: large frequency range and little squandered 
energy - and all this in a small “box” that hts inside the human head and 
leaves enough room for seeing, feeling, speaking and the rest. 
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ELECTRICAL ANALOG 
FOR BASILAR MEMBRANE 




FOR A SINGLE FREQUENCY . 




Fig. 9.6. Simplified electrical model of the basilar membrane: inductances represent 
fiuid and membrane inertias, capacitances stand for membrane elasticities, and 
resistances represent mechanical losses. For a single frequency, the model can be 
further simplified as shown at the bottom: to the left of the resonance place, the 
basilar membrane acts as a delay line and beyond the resonance as an attenuator 



In this manner, each frequency in the audio frequency range has its own 
“place” on the BM where it will cause maximum vibration. This observation 
has led to the so-called “place theories” of pitch perception, according to 
which the position of maximum vibration of the BM determines the pitch of 
a pure tone. 

The inner ear, or cochlea^ acts essentially like a bank of overlapping band- 
pass filters. The mechanical filtering action is provided by the basilar mem- 
brane in the inner ear. The bandwidths of these filters are called critical 
bands of hearing. Below 500 Hz, the critical bandwidth is a constant 100 Hz. 
For higher frequencies the bandwidths are roughly one fifth of the center 
frequencies [9.17]. 

The size of these critical bands is based on the anatomy of the basilar 
membrane, each critical band corresponding to about 1.5 mm in length along 
the membrane, which has a total length of about 36 mm. Thus there are a 
total of 24 critical bands, covering the auditory range of the healthy human 
ear between 20 Hz and 20 000 Hz. The critical bands are usually numbered 
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from z = 1 Bark (50-150 Hz) to z = 24 Bark (12 000-15 500 Hz).^ Since 
the relation between frequency / and Bark is linear for small frequencies 
and exponential for large frequencies, the author once commandeered the 
hyperbolic sine-function for an analytic conversion formula between / and 
Bark [9.19]. But H. Traunmiiller had a better idea; a simple rational function 
will do: 



26.81/ 
1960 Hz + / 



- 0.53 , 



where the frequency is in Hertz [9.20]. For / = 1000 Hz, this formula yields 
z = 8.53, in close agreement with Zwicker’s value {z = 8.50). 

The inner ear is uniformly innervated by about 30000 ascending nerve 
fibers connecting the cochlea with the auditory centers in the brain. The me- 
chanical motion of each critical band is therefore “sensed” by some 1200 nerve 
fibers. As one might expect, signals whose frequency content falls within one 
critical band are simply added together. This means that, for incoherent sig- 
nals (such as independent noises), the total power is the sum of the individual 
powers or intensities. 

By contrast, signals whose frequency components do not fall into the same 
critical band are combined at a higher level in the auditory pathway to the 
brain. It is as if the third or fourth roots of the powers are added together 
(and the sum taken to the third or fourth power). As a result, two noises 
of the same power, but falling into different critical bands, have a loudness 
corresponding to an increase in power by a factor of between 8 and 16, rather 
than just a factor of 2 for two noises with overlapping frequency content. 

The observation that there are two different laws for adding the effects of 
different sounds (to calculate their combined loudness, for example) has in 
fact led to the discovery of the critical bands in the first place. Apart from 
loudness, many other subjective phenomena of hearing are governed by the 
critical bands, including masking. A narrow-band noise masks a pure tone 
within the same critical band much more effectively than outside its critical 
band. 

A serious obstacle in these place theories was the relatively low fre- 
quency resolution (the low “Q”) of the BM as observed by Bekesy. Ps^- 
c/ioacoustically, the just-noticeable- frequency difference between two tones 
presented in succession to a human listener is less than 3 Hz at 1000 Hz! Nev- 
ertheless, undaunted model builders were hardly at a loss to “explain” this 
impressive frequency discrimination. They explained it by assuming sophisti- 
cated neural processing following the crude mechanical filtering action of the 



^The designation of the critical bands by “Bark” is unrelated to dogs. It was 
coined by Z wicker [9.18] after the German engineer Heinrich Georg Barkhausen 
(1881-1956) who invented an early cm-wave oscillator and, in 1911, was appointed 
to the first professorship in communications engineering {Schwachstromtechnik) . 
He proposed subjective measurements of loudness and introduced the loudness unit 
phon. 
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BM. We will not trace their intricate (and probably erroneous) reasoning and 
simply remark that in more recent work the mechanical filtering of the BM 
was shown to be much more frequency-selective than found by Bekesy. The 
reason why Bekesy did not observe this high frequency resolution is that he 
worked with cadavers and had to use large sound amplitudes in order to be 
able to see the BM motions under his microscope. It is now known, through 
the work of Rhode and others [9.21], that frequency selectivity is substantially 
lowered at high amplitudes and within minutes after metabolism ceases. 

The great sensitivity and frequency selectivity of our hearing is now be- 
lieved to be the result of active amplification mechanisms in the inner ear. 
Convincing evidence that such mechanisms are at work come from “Kemp 
echoes” and other oto-acoustic emissions [9.22]. In the Kemp echo, named 
after its discoverer D.T. Kemp, a short acoustic impulse or tone burst is 
reflected in the inner ear and emerges from the ear canal to the outside air 
with a delay that exceeds the roundtrip delay expected for purely mechanical 
processes including the traveling wave transmission on the BM. Even more 
surprising, the energy of the echo exceeds the energy expected for a purely 
passive reflection in the inner ear. Rather the relatively long delay and the 
gain in energy point to an active amplification mechanism at or near the 
point of mechanical-to-neural transduction. 

Besides the Kemp echo, even spontaneous oto-acoustic emission, i.e. 
sounds coming out of the ear without any acoustic “provocation,” have been 
observed [9.23]. These probably reflect feedback instabilities of the amplifiers 
responsible for the Kemp echo. The nonlinearities of the inner ear, as evi- 
denced in Tartini’s terzi suoni and other combination tones are now thought 
to result from an overloading of these amplifiers rather than purely mechani- 
cal nonlinearities (violations of Hooke’s law of elasticity), which were always 
unbelievable because, as mentioned before, these nonlinearities are observ- 
able near the threshold of hearing when mechanical motion is comparable 
to the diameter of the hydrogen atom! Further evidence that the nonlinear 
distortions stem from overloading amplifiers is furnished by the fact that 
these low-level nonlinearities disappear when inner-ear metabolism ceases as 
a result of cutting the oxygen supply or administering ototoxic drugs - or 
death. 

On the basis of these observations, the amplifiers involved in sharpening 
the BM response are believed to be of a biochemical nature. The hypothesized 
chain of events is as follows 

- mechanical motion of the BM is sensed by the inner hair cells which stim- 
ulate the release of biochemical energy sources, i.e. energy-laden molecules 
like those found in muscle tissue. 

- the energy is converted into mechanical strains which are transmitted by 
the outer hair cells back to BM to amplify its motion. 
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Of course, such a feedback arrangement can lead to instabilities: a screech- 
ing or otherwise unpleasant sound perceived by the sufferer and that occa- 
sionally can even be heard by bystanders (external tinnitus). 



9.9 Mechanical to Neural Transduction 

If an acoustic sinewave impinges on the outer ear, the BM is set into a 
motion which is (approximately) a sinusoidal function of time. This motion 
is transmitted to the outer hair cells, “riding on top” of the BM, via “cilia,” 
the little tufts of hair growing on each hair cell. These hairs are believed 
to be attached to the tectorial membrane. The relative motion between the 
BM and the tectorial membrane will bend (or otherwise deform or displace) 
these little hairs which thus play a role not unlike that of a phonograph 
pickup needle in converting mechanical vibrations into electrical signals. The 
motion of the cilia alters the electrical conductance of a biological membrane 
at the surface of the hair cell thereby modulating an electrical current flowing 
through the hair cell [9.24]. The sources of this current are well-documented 
potential differences between the different channels {scalae) of the inner ear. 

The alternating current in turn is believed to influence the release of 
synaptic “vesicles” in the hair cell. These vesicles, little packets containing a 
chemical “transmitter” substance, migrate to the inner surface of the hair cell 
opposite the terminal of an “afferent” nerve fiber which they cause to “fire” . 
These nerve firings can be recorded by means of a microelectrode whose tiny 
tip (less than a thousandth of millimeter in diameter!) has been inserted into 
the inside of the nerve fiber. 

Electrical pulses (“spikes”) in the acoustic nerve are about 0.5ms in du- 
ration and can occur at an average rate exceeding 100 pulses per second even 
without acoustic stimulation, i.e. “spontaneously”. A typical figure for the 
spontaneous firing rate is about 50 pulses per second. During strong steady- 
state acoustic stimulation, the firing rate can go to 150 or more pulses per 
second. 

During the onset of strong stimulation, the firing rate can, for a short 
time, go up as high as 1000 pulses per second. Much higher rates are not 
possible because of the refractory period ( “dead time” ) of the nerve after each 
firing. This dead time, during which the nerve fiber restores its membrane 
characteristics is about 1 ms in the acoustic nerve. 

For continued stimulation, the high-onset firing rate decays quickly to the 
so-called adapted firing rate. This “adaptation” is a property the acoustic 
nerves share with other neural structures. 

Adaptation appears to be one of the pervading facts of neural life which 
any serious model must take into account. In a typical model, adaptation 
is brought about by a “depletion” mechanism, i.e. a mechanism by which 
quanta (vesicles) of a chemical agent in the hair cell are used up in firing the 
attached nerve [9.25]. 
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Fig. 9.7. Firing rates of nerve spikes as a function of time for different waveforms. 
The individual nerve acts as a nearly linear half-wave rectifier 



Besides adaptation and refractoriness, outstanding features of acous- 
tic nerve activity are a kind of (one-sided) linearity and an automatic gain 
control. Linearity means that the firing probability of the acoustic nerve, as 
determined during many cycles of a periodic stimulus, is a close (almost lin- 
ear) replica of the “positive” portions of the stimulating waveform, see Fig. 
9.7. During negative portions of the stimulus waveforms, the firing probability 
is suppressed below the spontaneous rate. Thus the transduction process, at 
least for a finite range of amplitudes, acts as an approximate linear half-wave 
rectifier. 

The automatic gain control inherent in the transduction process means 
that, within a certain range of amplitudes, the firing probability is almost in- 
dependent of the signal amplitude. It is particularly noteworthy that the gain 
control mechanism leaves the wave/orm largely intact. Thus the gain control 
seen in the acoustic neuron does not act like an instantaneous compressor - 
let alone like an amplitude clipper. Its action resembles the adjustment of a 
volume control with a response time of some 20 ms. 

One of the striking features of the auditory cortex is its tonotopic or- 
ganisation for both “carrier” frequencies and modulation frequencies. It has 
been known for some time that two pure tones of similar frequency stimulate 
adjacent neurons in the cortex. More recently it was discovered that the all- 
important modulation frequencies, too, are represented tonotopically in the 
brain: similar modulation frequencies stimulate adjacent columns of neural 
tissue. This type of neural organization is reminiscent of the discoveries of 
Hubei and Wiesel for the visual system. 
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9.10 Some Astounding Monaural Phase Effects 

In order to explain this remarkable insensitivity of the ear to phase and thus 
to the signal waveform^ Ohm and Helmholtz proposed the following model: 

- the ear has a set of tuned “bandpass filters” covering the audio frequency 
range, and 

- the ear measures the amplitude at the output of each filter and transmits 
only this information to the brain. 

The bandpass filters were thought to be realized by “tuned strings” in the 
BM. The hair cells were the obvious candidates for the amplitude measuring 
device. 

Controversy has raged throughout the latter part of the 19th century and 
the first half of the 20th century as to the validity of Ohm’s Acoustic Law. 
Most published counterexamples were eventually traced to faulty equipment 
which generated distortion products whose interference (with the signal and 
with each other) did depend on relative phase angles. (Note that much of 
the early work on phase perception was done before electronic filters and 
amplifiers had become available!) 

Nevertheless, a hard core of genuine phase effects remained - the elegant 
“AM-FM” experiment by Mathes and Miller being perhaps the best known. 
The stimulus in the AM-FM experiment consists of an AM carrier at, say, 
2000 Hz, the modulation frequency being, for example, 100 Hz. The signal 
thus consists of three frequency components at 1900, 2000, and 2100 Hz. 

If the phase of one of the sidebands at 1900 or 2100 Hz is changed by 180°, 
the AM signal changes into a “quasi-FM” (QFM) signal (“quasi,” because 
the frequency modulation is accompanied by a small amplitude modulation 
at twice the modulation frequency). 

On listening alternately to the AM and QFM signals, a pronounced acous- 
tic quality difference is perceived. (If Ohm or Helmholtz had been able to lis- 
ten to these signals, they might have been reluctant to formulate their phase 
law. They would have had to discard their simple ideas - to the detriment 
of their followers whose work was based on the tuned-filter model of hearing. 
Primitive equipment, such as Helmholtz had to work with, sometimes helps 
to formulate powerful ideas - even if, in the end, these ideas emerge as mere 
approximations. (Would Newton have dared to formulate his laws of motion 
and gravitation if Kepler’s observations had been beset by relativistic side 
effects such as the perihelion motion of Mercury?) 

The outcome of the AM-FM experiment revived interest in the ear’s ca- 
pability to decode waveforms (as opposed to amplitude spectra). As early 
as 1954, upon joining Bell Laboratories, I became intrigued by the possi- 
ble effects of waveform on the quality of synthetic speech. To what extent 
was the poor quality of vocoder speech [9.26] caused by waveform effects? 
Vocoders, in conformity with Helmholtz’ thinking, reproduced the short-time 
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Fig. 9.8. Top: one period of a periodic pulse train consisting of 31 harmonics (1 
through 31) in cosine phase (0°). Bottom: the same 31 harmonics but with “random” 
phase angles (0° or 180°). The range from maximum-to- minimum signal amplitude 
(the “peak factor”) is reduced by a factor 2.64 compared to the signal shown at the 
top. In this illustration, the phase angles were computed from a theoretical formula 
based on frequency- modulated signals [9.27] 



amplitude spectrum of speech signals with no regard to the phase angles of 
the analyzed speech signal. Specifically, voiced speech sounds are synthesized 
from nearperiodic pulse trains whose harmonics all have equal (zero) phase 
angles. If one randomized these phase angles, a signal with a much smaller 
“peak factor” (defined as maximum-to-minimum amplitude range divided by 
the rms amplitude) would result. Figure 9.8 shows the result of such a phase 
angle “randomization” . 

In listening to the two waveforms depicted in Fig. 9.8 an astonishingly 
large quality difference is perceived - in spite of the identity of their amplitude 
spectra (31 harmonics of equal amplitude from 100 Hz to 3100Hz). Synthetic 
speech signals obtained from these two “excitation” signals likewise sounded 
quite different (less “buzzy” for the low-peak- factor waveform). 

The first explanation attributed this quality difference to the large differ- 
ence in peak factor. But subsequent experiments with waveforms of similarly 
low peak-factors showed that every one of them sounded different - except 
when their phase angles were related by a “linear” transformation [9.28], i.e. 
if the two sets of phase angles (pk and ipk were related to each other as 

= 0/c + + /3fc , 

where a and (3 are arbitrary constants and k is the harmonic number. 



(9.1) 
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The term f3k in (8.1) is trivial since it causes a simple delay of (/^/27 t/i), 
where fi is the fundamental frequency. 

The a term in (8.1), however, is significant because it does lead to a 
change in the waveform - yet, as already stated, it leaves the acoustic quality 
unchanged. 

Which aspect of the signal s{t) remains invariant when the phase of each 
harmonic is changed by the same amount? The answer: the signal envelope 
e{t) defined by (see also Chap. 10) 

e\t) = s\t) + s\t) , (9.2) 

where s{t) is the Hilbert transform of s{t): 

m = - r ^dr . (9.3) 

^ J — OO ^ ^ 

With the aid of the Hilbert transform, an analytic signal a{t) can be defined 



<j(t) = s{t) — is{t) , 



(9.4) 



whose Fourier transform vanishes for all negative frequencies. 

An alternate way of writing the analytic signal uses the previously defined 
envelope e{t) and a phase function y(t): 



a{t) = e{t) exp[i(/?(i)] , 



where 



y(t) = arctan 



Hty 

s(t)_ ■ 



(9.5) 



(9.6) 



Since a{t) has no negative frequencies, a phase shift by a is equivalent to a 
multiplication of cr{t) by exp(io;): 



o'a{t) = e(t) exp[i(,^(t) + ia] 



(9.7) 



The corresponding phase-shifted real signal is 



Sa{t) = e(t) cos[(/?(t) + a] , (9.8) 

whose envelope e{t) is thus seen to be independent of a phase shift. 

In other words, phase transformations of the form ifjk = (j)k Pk 

which do not change the acoustic quality of a signal also leave the envelope 
invariant (except for a delay undetectable in monaural listening). 

What would thus be more natural than to assume that the ear acts as an 
envelope detector on the BM waveforms it “sees”. This is, in fact, the view 
originally suggested and widely accepted among acousticians. 

Nevertheless, a strict envelope hearing hypothesis and the linear-phase 
transformation rule are contradicted by several listening experiments with 
signals containing only two frequency components. For such signals, any 
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phase transformation is a linear-phase transformation of the form (8.1) and 
the envelope is given solely by the amplitude spectrum. Thus phase manip- 
ulations on such signals should be undetectable. However, Craig and Jef- 
fress [9.29] have shown that this is not so, although the phase effects they 
found were rather subtle. 



9.11 Masking 

Masking is one of the most pervasive facts of human hearing. Masking means 
that one sound makes another, weaker sound, inaudible. A pedestrian wear- 
ing a Walkman playing loud music may not hear the oncoming truck. Rever- 
beration makes the following weaker speech sounds inaudible and therefore 
unintelligible. 

But even if a sound is not rendered completely inaudible, its perceived 
loudness may be reduced by a stronger sound. This is called loudness re- 
duction or partial masking. The masking effect is typically limited to the 
frequencies near that of the masker which could be a tone, a tone complex 
(chord), or a noise - narrow or wide. 

Except at very low sound levels, the frequency dependence of masking 
is asymmetric: there is more upward than downward spread of masking. In 
other words, a noise around 1000 Hz masks a tone at 1200 Hz more effectively 
than a tone at 800 Hz. For masking within a critical band, the tone can have 
a sound level as low as 3 to 6 dB below the noise level before it is completely 
masked. 

For the quantization of speech signals the masking - not of a tone by 
noise, but of noise (quantizing noise) by tones (speech) - is the interesting 
question. Psychoacoustic experiments by J. L. Hall, Jr., and the author have 
revealed that the presence of a weak noise, as weak as 24 dB below a pure 
tone, can still be detected. How is this possible? Listening to such stimuli 
shows that, at such low levels, the presence of a noise is not perceived as a 
noise per se but as a slight “quaver” in the tone. Hence masking of noise by 
tones is quite different from masking of tones by noise. 



9.12 Loudness 

The loudness of a pure tone depends not only on its physical intensity or level 
but also on its frequency. The loudness level of a test tone (or any acoustic 
signal) is determined by adjusting the level of a 1-kHz tone until it sounds 
equally loud as the test stimulus. The level in decibel above threshold of the 
1-kHz tone is then called the loudness level of the test stimulus. The unit of 
loudness level is the phon. 

Loudness is the perceptual attribute of the physical intensity of a sound. 
The unit of loudness, the sone^ is defined as the loudness of a binaural 1-kHz 




9.13 Scaling in Psychology 175 



tone at 40 dB sound pressure level (SPL) above the threshold of hearing. A 
sound that is perceived as twice as loud is given the loudness value L = 2. 
In this manner, by subjective loudness judgments, an entire loudness scale 
can be established ranging from zero sones at the threshold of hearing to 256 
sones at a sound intensity I = 120 dB SPL. Over a large range of hearing 
(above about 20 dB SPL) a simple power law exists between loudness L and 
intensity I: 

L oc . (9.9) 

This law means that merely to double the loudness of a rock group of five 
musicians, say, we have to increase their number tenfold^ to 50 players of equal 
power output. (This minor calculation explains the resounding enamoration 
of popular music makers with electronic amplifiers.) 

By the same token, if we want to halve the loudness of a continuous 
“rumble” emanating from a busy highway, we have to reduce the acoustic 
noise output by a factor of ten! This may sound difficult, but it is not, at 
least not from a purely physical point of view: tire noise - the main culprit at 
steady highway speeds - decreases drastically with decreasing vehicle speed. 
In fact, the noise intensity is approximately proportional to the fourth power 
of speed. 

On the other hand, a tenfold increase in the average intensity of traffic 
noise caused by a tenfold increase in traffic density can raise the rate of 
complaints by irate residents perhaps a hundred fold: one loud truck every 
5 minutes may be tolerable, but one every 30 seconds could be a nightmare 
and would certainly make outdoor conversation nearly impossible. And what 
is true for trucks is just as true for low-flying aircraft. 



9.13 Scaling in Psychology 

Whereas measurement in classical physics is a well-understood process, relat- 
ing an observed quantity to a well-defined unit, the situation in psychology 
was not so clear-cut until the physiologist E. H. Weber (1795-1878) - brother 
of the physicist Wilhelm Weber (1804-1891) - made careful studies of the 
sensations of sound and touch, thereby laying the foundations of a new sci- 
ence, the science of sensations. According to Weber’s law, an increase in 
stimulus necessary to elicit a just noticeable increase in sensation is not a 
fixed quantity, but depends on the ratio of increase to the original stimu- 
lus. Later, the physicist and philosopher G. T. Fechner (1801-1887) restated 
Weber’s law (now called the Weber-Fechner law) and specified its domain 
of validity.^ Modern psychologists, and particularly S. S. Stevens, have suc- 
ceeded in introducing measurement methods into psychology that are nearly 

^Fechner also fathomed experimental aesthetics by measuring which shapes and 
dimensions are most pleasing. He may have been the first to conduct a public 
opinion poll (to discover which of two Holbein paintings was preferred by viewers). 
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as unambiguous as objective measurements in physics [9.30]. The new dis- 
cipline has therefore rightly earned the designation psychophysics^ of which 
psychoacoustics is a special branch, as is psychovisual research. 

One of Steven’s great contributions was the introduction of ratio scales for 
subjective variables (like loudness and brightness) and the discovery of sim- 
ple power-law relations between these subjective variables and corresponding 
physical quantities (like energy flux or intensity) [9.31]. 

As already mentioned, for a sound to double in loudness L, its intensity 
I has to be multiplied by a factor or 10. Thus, because log^Q 2 0.3, we 

have the following power law for loudness as a function of acoustic intensity: 

Someone who has not participated in a psychoacoustic scaling test might 
object that “loudness doubling” is not a well-deflned concept. But surpris- 
ingly, the random scatter encountered in such tests is remarkably small even 
between different listeners. 

The exponents found in psychophysical power laws, such as the value 
0.3 in the above equation, are not universal but are specific to the sense 
modality studied (subjective brightness, perceived weight, or apparent length, 
for example) and have been analyzed in great detail by psychophysicists. One 
important research question concerns the transitivity of these exponents when 
comparing loudness with weight and weight with brightness, for example, and 
what it might reveal about brain functions. 

If we replace the sound intensity I in (8.9) by the sound pressure p, then, 
because intensity is proportional to pressure squared^ we have 

L (X p^-^ . 

Interestingly, the exponent 0.6 can be derived from an exponent of 0.5 found 
at a more fundamental level, the Fourier-like “critical” frequency-band de- 
composition of sounds in the inner ear. The exponent 0.5, in turn, turns 
our attention in the direction of statistical analysis and uncertainty result- 
ing from the firing rate of nerve pulses traveling along the acoustic nerve 
up to higher auditory centers in the brain. If these pulses were a modulated 
Poisson process whose mean rate was proportional to the sound pressure p, 
then the uncertainty of the number of pulses in a given time interval (100 ms, 
say) would be proportional to p^-^. Since many ratio scales in psychophysics 
are found to be directly related to perceptual uncertainties (“just noticeable 
differences”), the observed power law for subjective loudness versus physical 
intensity would then indeed be predicted by such a statistical model of neural 
firing rates. 

In reality, loudness perception is more complicated, but the observed 
power laws and their exponents have yielded important clues and steered 
researchers in the right direction. 
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9.14 Pitch Perception and Uncertainty 

The just-noticeable- frequency difference between two tones around 1000 Hz 
is less than 1 % for normal listeners. Trained listeners can achieve a frequency 
discrimination of at least 3 Hz (0.3%) at 1000 Hz. According to the Heisen- 
berg uncertainty principle, this high frequency resolution implies a time win- 
dow longer than about 0.3 s. On the other hand, in listening to speech humans 
are known to discriminate temporal detail of 0.01s or less. In other words, 
the time-bandwidth uncertainty product of the human ear appears to equal 
0.03 instead of 1, making human hearing 30 times better than the laws of 
physics allows. How is this possible? 

The only answer seems to be that, depending on the task, the human 
auditory system can use one of several spectral analyzers with different time 
resolutions and frequency discriminations. Thus, in listening to speech, the 
hearer uses an analyzer with a high time resolution (0.01s). When listen- 
ing to (slow) music, he or she switches to an analyzer with high frequency 
discrimination. 




10. Binaural Hearing — 
Listening with Both Ears^ 



Hear the other side. 



St. Augustine (354-4^0) 



Although, at present, telephone speech is mostly monaural, the future prom- 
ises much binaural business. Think of tele-conferences and virtual acoustic 
spaces. The following informal overview of binaural hearing covers directional 
hearing (in the horizontal and vertical planes), the precedence and Haas 
effects (and their applications in public-address and “assisted-resonance” 
systems), artificial reverberation, pseudo-stereophony, binaural release from 
masking, the cocktail-party effect, central-pitch phenomena, Deutsch’s octave 
illusion, the creation of virtual sound images, and the faithful reproduction 
of concert hall recordings in an anechoic environment for acoustical quality 
studies. The chapter concludes with a brief review of sound-diffusing surfaces 
based on number theory that have become standard equipment in sound 
recording studios, including those for speech signals. 



10.1 Directional Hearing 

Our directional sense of hearing is astounding. Differences of arrival times 
at the two ears of a few tens of microseconds can be clearly perceived as 
a change in direction. Thus the time resolution. At, of our ears in binaural 
hearing exceeds by a factor of 10 or more that of monaural hearing which lies 
in the millisecond range. 

This exquisite time resolution translates to an angular discrimination Aa 
of just a few degrees in the horizontal plane - and the horizontal plane is, 
of course, the place where the sound sources of greatest interest to us cavort 
and where much of the danger lurks. The approximate relationship is 



^Adapted in part from Music Perception 10 , 225-280 (1993). 
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Fig. 10.1. A Swiss naval pilot, on board a lake steamer, wearing a listening device 
with an extended binaural base to better locate a fog horn 

( 10 . 1 ) 

a 

where c is the velocity of sound in air and d is the “baseline,” i.e. the effective 
distance between the two ears [10.1]. 

Of course, with a longer baseline between our ears, our directional hearing 
would be even more acute, as furniture designers and the military have dis- 
covered early on. Figure 10.1 shows a Swiss ship’s pilot and Fig. 10.2 depicts 
an armchair with “ear-trumpets,” an early binaural preprocessor. 

How do we localize in the vertical plane? Admittedly, this is not as im- 
portant to humans as localization in the horizontal plane. But, as already 
mentioned, people do have considerable vertical directional discrimination. 
This is a bit puzzling because there are no arrival time or intensity differ- 
ences between the ears for sounds arriving in the symmetry plane through 
the human head. Could slight head motions be responsible? This possibility 
has been eliminated by having subjects bite into bite-boards to immobilize 
their heads. Finally, Jens Blauert came along and showed that the frequency 
content of a signal was responsible for elevation judgments [10.2]. 

Differences in spectral content are mediated by sound diffraction at the lis- 
tener’s head, causing different filtering actions for sound waves arriving from 
different directions in the symmetry plane. Thus, for example, certain high 
frequencies coming from behind the listener are attenuated by the shadowing 
effect of his outer ears (pinnae). Distance judgments, too, are facilitated by 
spectral content [10.3]. 
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Fig. 10.2. Early model of an 
integrated binaural hearing aid. 
(Patent pending?) 



10.2 Precedence and Haas Effects 

One of the more astonishing attributes of binaural hearing is the precedence 
effect, or “law of the first wavefront.” This effect was first described over 
100 years ago by the Princeton physicist Joseph Henry (1797-1878), whose 
name is enshrined in the unit for magnetic inductance. Henry observed that 
when two (or more) similar transient sounds reach a listener from different 
directions in rapid succession, the listener hears a single sound from the 
direction of the first-arriving sound. 

It has been hypothesized that the precedence effect has evolved because, 
in a reverberant environment (think of a dense forest or an echoic cave), it 
is the direction of the first wavefront that betrays the direction of a predator 
on the prowl or a potential mate - or a source of live food. 

In 1950, Helmut Haas, in his Gottingen thesis, showed that the later 
sounds can exceed the first sound in intensity by some 10 dB without causing 
the perceived direction to change much [10.4]. 

This so-called Haas effect (so named by R. H. Bolt) allows the enhance- 
ment of sound levels in a public-address system by means of loudspeakers 
radiating an amplified but delayed signal without becoming audible and 
distracting the listener’s attention from the original, albeit weaker, sound 
source. A “smart” public-address system exploiting the Haas effect was first 
installed in 1953 by Parkin in St. Paul’s Cathedral in London. The “assisted- 
resonance” system in Royal Festival Hall (to improve its low-frequency re- 
sponse) is a further exploitation of the first-wavefront principle. And the 
Palace of Congresses in the Moscow Kremlin is an example of a hall where 
artificial reverberation completely dominates the natural sound. (The rever- 
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beration is “manufactured” in the subterranean chambers of the Palace and 
piped into the main hall by a plethora of loudspeakers.) 

The precedence effect can be nicely demonstrated in an amusing experi- 
ment conceived by the late Nico Franssen of the Philips research laboratories 
in Eindhoven, the Netherlands. A tone signal is fed to two loudspeakers, 
the left speaker radiating only a short transient and then falling silent, see 
Fig. 10.3. The right speaker radiates a slightly delayed and softly turned-on 
steady tone. Because of the precedence effect, the sound is located at the left 
speaker. 

In a reverberant environment, listeners invariably perceive the tone as 
continuing to come from the silent speaker and are amazed when the demon- 
strator “pulls the plug” on the left loudspeaker and the thus disconnected 
loudspeaker still seems to be the active source. Interestingly, this paradoxical 
percept only works in a reverberant environment with lateral reflections in 
which the binaural cues for a steady tone (amplitude and phase differences) 
are ambiguous as a result of multipass transmission. In this kind of situation 
the ear appears to cling to the only unambiguous clue, the initial transient. 
This is another instance of the “continuity effect” which is also manifest in 
other sensory modalities, including monaural perception, see Sect. 9.4.2. 

Equally surprising is the famous “cocktail-party effect” which allows a 
binaural listener to suppress unwanted sounds when their directions do not 
coincide with that of the wanted signal. I think it is fair to say that without 
this remarkable ability of human hearing the cocktail-party as a social insti- 
tution would have become extinct long ago (assuming, of course, that people 
actually want to listen to what other people have to say, see Fig. 10.4). 




Fig. 10.3. Binaural localization completely gone astray. In a reverberant room, the 
tone is heard as emanating from the left loudspeaker even after it has long been 
turned off there and, in fact, is radiated only from the right loudspeaker. This is 
another instance of the “continuity effect” that pervades all sensory modalities. - By 
contrast, in an anechoic environment, on an open lawn, for example, the auditory 
percept switches immediately from left to right after the left loudspeaker ceases to 
emit sound because there is no binaural amplitude and phase ambiguity as in a 
reverberant space 
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“Although humans make sounds with their mouths and occasionally 
look at each other, there is no solid evidence that they 
actually communicate among themselves/’ 

Fig. 10.4. A duo of dolphins belittling human communication - caught in the act 
by Sidney Harris 



The electronic simulation of the cocktail-party effect and other schemes 
for suppressing unwanted noises are among the more ambitious aims of future 
binaural hearing aids and conference telephone systems. 



10.3 Vertical Localization 

How is vertical localization possible? One older theory attributed this ability 
to small (involuntary) head motions. But the ability to localize in the median 
plane persists even when the listener’s head is rigidly fixed (e.g. by a “bite- 
board”). 

The answer is that for different directions of incidence, the sound is differ- 
ently diffracted at the head and the pinnae. Thus the sound waves entering 
the external ear canal are, in fact, “filtered” by diffraction. For each direction, 
characteristic peaks and valleys are superimposed on the spectrum of the in- 
cident sound. If the sound has a sufficiently broad spectrum (as is true for 
speech, music, and most noises), these peaks and valleys in the spectrum of 
the acoustic signal that strikes the eardrums can be recognized and utilized to 
determine the direction of incidence [10.5]. People have apparently learned, 
over many listening experiences, to associate different spectral characteristics 
with different vertical directions! 

If this hypothesis is correct, then a broadband noise, whose spectrum has 
been shaped in accordance with the diffraction process for a given direction 
of incidence, should elicit the corresponding subjective direction irrespective 
of the actual direction. Subjective tests employing such filtered signals have 
confirmed this hypothesis to an astounding degree [10.2], see Fig. 10.5. 

In fact, another puzzle that has defied solution for many decades, has 
finally been explained by considering spectral characteristics. The puzzle is 




184 10. Binaural Hearing - Listening with Both Ears 

8kHz 



300Hz, 3kHz 




Fig. 10.5. Sound localization in the vertical plane - long a mystery because of the 
absence of binaural difference cues - is apparently determined by spectral shape and 
frequency content. Strong frequency components around 300 Hz and 3 kHz elicit a 
frontal perception, frequency bands around 1 kHz and 10 kHz favor the perception 
of sound arriving from the back; emphasis of spectral components around 8 kHz 
elicit a “from above” percept 



the following. Why, when listening over earphones, do we hear the sound as 
originating inside our head? Why is it so difficult, or even impossible, to 
“externalize” the sound source as we habitually do when listening to primary 
sound sources or to loudspeakers? 

One answer, that evoked some credence for a time, was that, with ear- 
phone listening, sound sources followed head motions and that such seemingly 
nonstationary sound sources evoked sound images inside the listener’s head. 
Another “theory” held the mechanical pressure of the headset’s ear cushion 
on the head responsible for “internalization”. Benjamin Bauer of CBS Lab- 
oratories, in an ingenious experiment, disproved both theories by “pressure- 
less” earphones whose signals were automatically modified when the listener’s 
head moved, to correspond to the modifications experienced when listening 
in a free sound field. Again, many listeners (including the author) failed to 
externalize the sound images. 

The failure to externalize with earphone listening is now attributed to the 
following set of circumstances. When wearing earphones, standing waves are 
set up in the external ear canal between the eardrum and the membrane of the 
earphone. These standing waves have a filtering action which is rather differ- 
ent from the spectral peaks and valleys caused by diffraction at the listener’s 
head in a free- field listening situation [10.6]. Thus the listener can associate 
no external location with earphone listening and consequently associates the 
sound sources with the only remaining location: inside the listener’s head. 

If this theory is correct, then inverse filtering (to remove the effect of 
standing waves in the ear canal) combined with a filter response correspond- 
ing to free- field listening should give externalized sound sources. This has, in 
fact, been demonstrated by Laws [10.7] and others. 

The late R. L. Wallace, Jr., of Bell Laboratories has constructed earphones 
which minimize standing waves between the earphone membrane and the 
eardrum by placing absorbing materials at the entrance to the ear canal. 
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With these earphones people are able to externalize sound sources although 
not as consistently as with electrical filters, which not only eliminate the 
effect of the standing waves but substitute the proper filtering action of head 
diffraction. 



10.4 Virtual Sound Sources and Quasi- Stereophony 

An important application of filtering sound signals to evoke proper local- 
ization was demonstrated as early as 1962 [10.8]. In ordinary two-channel 
stereophonic systems, perceived sound sources are usually restricted to the 
space between the two reproducing loudspeakers. Physical sound sources to 
the rear, overhead, and to the sides (outside the line connecting the two 
loudspeakers) are not properly reproduced in their apparent positions. Yet, 
because we have only two ears, two loudspeakers should suffice to evoke all 
the proper perceptions of acoustic space - provided the sound waves radiated 
from the two loudspeakers are “tailored” in such a way as to produce, at 
the listener’s eardrums, pressure waves indistinguishable from those that the 
ears would have received in a free sound field set up by the desired sources 
(including sources to the rear, overhead, and to the extreme sides). 

Implementation of this idea requires the prior measurement of the com- 
plex transmission functions (i.e. amplitude and phase as functions of fre- 
quency) between a loudspeaker in an anechoic chamber and the right and 
left eardrums of a human listener. 

Suppose two loudspeakers are placed in front of a listener, one, say, 30° 
to the right and the other 30° to the left. Call the transmission function from 
a loudspeaker to the eardrum on the same side S{f) and that to the eardrum 
on the other side A{f). 

If A{f) was zero, i.e. if there was no “crosstalk” from the loudspeakers 
to the “far” ears, the task of providing each eardrum with a specified sound 
signal would be simple. The loudspeaker signals must be filtered by the in- 
verse of the “same-side” transmission function However, regrettably 

for our purposes, there is crosstalk to the “other-side” ear due to sound 
diffraction around the human head. This crosstalk must be cancelled. 

This can be accomplished by the other loudspeaker radiating an appro- 
priately filtered crosstalk compensation signal. Of course this crosstalk com- 
pensation signal also “talks across” to the ear for which it is not intended 
and must, therefore, be compensated by a compensation signal supplied to 
the first loudspeaker - and so on ad infinitum. 

The all-inclusive solution to this multiple filtering and cancellation prob- 
lem is illustrated in Fig. 10.6, where C{f) = —A{f)S~^{f) is the transfer 
function of the crosstalk compensation filter. The overall transmission func- 
tion from the right input (i?) to the right ear (r) is then 

R^(f) = {1- C^)-^S-^S + C(1 - C‘^)-^S~^A , 
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Fig. 10.6. How can one transfer a pair of binaural signals, recorded, say, from a 
dummy head (see Fig. 10.7), to the ears of a human listener by means of loudspeak- 
ers (instead of earphones)? This schematic drawing shows how to compensate, by 
means of electronic filters, the cross-talk from the left loudspeaker to right ear and 
the right speaker to the left ear. This arrangement allows free-held listening and has 
been used for concert-hall simulations. With earphone listening, by contrast, the 
simulated (“virtual”) acoustic space would follow head motions instead of remain- 
ing stationary. The depicted presentation also avoids the “in-head” localizations 
that bedevil earphone listening 

with C = —AS~^^ Rr = 1 diS required. The overall response from the right 
input to the left ear {£) can also easily be read off Fig. 10.6: 

R^{f) = (1 - C‘^)-^S~^A + C(1 - = 0 , 

as required. 

The practical experience with the filtering scheme illustrated in Fig. 10.6 
has been nothing less than amazing. Although the two loudspeakers are the 
only sound sources, virtual sound images can be created far off to the sides 
and even behind the listener. In fact, even the elevation angle of a sound 
source is properly perceived (best by people with proper head shapes!). Since 
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the entire system is linear, many sound sources and their echoes can be re- 
produced simultaneously, without mutual interference, provided the listener 
is sitting in the proper position between the loudspeakers and does not turn 
his head away from the front direction by more than about ±10°. The spa- 
tial illusion is, in fact, so convincing that the listener is tempted to “look 
around” for any invisible sound sources. However, the moment he gives in to 
this temptation the realistic illusion disappears, frequently changing into an 
“inside-the-head” sensation. 

The sound reproduction method illustrated in Fig. 10.6 has opened up 
completely new possibilities in the study of concert hall acoustics [10.9]. Be- 
fore, in comparing two halls, one had to base one’s judgment on listening 
to pieces of music, played at different times, perhaps by different orchestras 
under different conductors. Even if all other factors were equal, the fact that 
two musical experiences are separated by days, weeks, or even months make 
any subtle quality assessments exceedingly unreliable if not impossible. 




Fig. 10.7. Dummy head, equipped with condenser microphones for eardrums, ac- 
companied by K.F. Siebrasse and D. Gottlob who earned their Ph.D. degrees on 
concert hall evaluation using dummy-head sound field recording 



Now, tape recordings of orchestral music are made with a Kunstkopf 
(artificial head), see Fig 10.7, in several “strategic” locations of the hall to be 
evaluated. These recordings are then played back over the system illustrated 
in Fig. 10.6. 

With the new reproduction method, instantaneous comparisons of iden- 
tical program material has become possible. The author will never forget the 
moment he first “switched” himself from a seat in the Berlin Philharmonie to 
one in the Vienna Musikvereinssaal listening to a British Orchestra playing 
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Mozart’s Jupiter Symphony. All he believed about the differences between 
these two halls based on previous visits (but was none too sure about) sud- 
denly became a matter of easy distinction. 



10.5 Binaural Release from Masking 

The cocktail-party effect is closely related to binaural masking level differ- 
ences, BMLD for short. For example, the audibility of a binaural tone buried 
in noise is improved by as much as 18 dB when the polarity of the tone in one 
ear is reversed. In fact, and paradoxically, the audibility goes up even when 
the tone in one ear is completely absent, see Fig. 10.8. 
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s(t) 

+ 

n(0 



Fig. 10.8. Binaural release from masking. Top: presenting the same tone and the 
same sufficiently strong noise to both ears leaves the tone inaudible; the tone is 
masked by the noise. Bottom: paradoxically, removing the tone from one ear makes 
it audible. Numerous other experiments on binaural masking and unmasking (!) 
suggest that the human binaural system can, if it is advantageous in a signal detec- 
tion task, form the difference of the two ear signals (with an error of about 10%). 
This ability results in a binaural improvement of detecting signals in noise of up to 
20 dB. Such binaural masking level differences are also responsible for the “cocktail- 
party effect,” the ability to suppress unwanted sounds (such as the speech babble 
during a noisy cocktail party) and concentrate on the desired voice 
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Theories of BMLDs postulate, among other processes, a binaural subtrac- 
tion mechanism, mediated perhaps by contralateral neural inhibition. Such a 
subtraction circuit in our brains would, of course, act as a noise suppressor 
when identical noises are fed to the two ears [10.10]. 
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Figure 10.9 illustrates a similarly paradoxical paradigm in vision. Whereas 
the clutter of grey pieces on the left make no sense, adding the masker (black) 
provides enough context to make the grey pattern intelligible. A.S. Bregman: 
Auditory Scene Analysis: the Perceptual Organization of Sound (The MIT 
Press, Cambridge, Massachussetts, 1990) 




Fig. 10.9. A visual paradox. Adding the masker (black) to the jumble of grey 
pieces on the left can make them intelligible 



10.6 Binaural Beats and Pitch 

The same interaural mechanism seems to be active also in the creation of 
binaural beats when tones of slightly different frequencies are applied to the 
two ears. Binaural subtraction would also explain a curious binaural pitch 
phenomenon, called Huggins pitch, in which two white noises with a relative 
phase shift create a sensation of a whistle-like pitch. The white or broadband 
noise is applied directly to one ear and through an allpass filter to the other 
ear, see Fig. 10.10. The allpass filter produces a phase shift of 360° in a 
narrow frequency region, say 30 Hz centered on 800 Hz. Subtracting the two 
earphone signals produces a narrow band of noise around 800 Hz. Such a 
noise sounds like a noisy whistle with an 800 Hz pitch, much like the binaural 
percept engendered by the arrangement of Fig. 10.10. 

In the 1960s, while exploring binaural pitch phenomena, I asked myself 
whether an intelligible speech signal could be created from two flat-spectrum 
(unintelligible) signals whose difference is the running short-time spectrum of 
a given speech signal. The required trick is to convert all spectral amplitude 
information of the speech signal into phase information and to do it in such 
a way that the amplitude information can be recovered by subtracting two 
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Allpass Filter 



2jT-phase change near 

Fig. 10.10. The Huggins pitch, named after its discoverer W.H. Huggins. A broad- 
band noise is applied to one ear. The same noise, filtered by an allpass filter that 
changes the phase by 360° at a frequency /o, is also applied to the other ear. This 
creates the sensation of a whistling noise with a pitch corresponding to the fre- 
quency /o. The effect can be explained by assuming that the auditory system can 
form the difference of the two ear signals 



White Noise 
Generator 



such phase-only signals. This is indeed possible and it makes for an impres- 
sive demonstration. When the two phase-only signals are fed to stereophonic 
earphones, each signal by itself is an unintelligible buzz, but listening to the 
earphones simultaneously, one hears intelligible speech. This idea, inciden- 
tally, can be turned into an amusing speech “secrecy” system involving two 
separate transmitting channels in which either channel is completely unin- 
telligible and only the combination of the two unintelligible channels renders 
the result intelligible [10.11]. 



10.7 Direction and Pitch Confused 

The Huggins pitch is an example of a binaural interaction - a sense that is 
supposed to tell us something about direction - but that instead produces 
a pitch sensation. This “misuse” of human binaural processing is even more 
evident in the so-called Fourcin pitch [10.12], another example of a pitch sen- 
sation produced centrally (inside the head) . In its simplest form a broadband 
noise is applied directly to one ear and with a delay r of, say, 5 ms to the 
other ear, see Fig. 10.11. This setup creates two sensations, one being, as 
expected, a lateral noise heard on the side of the undelayed input and the 
other, surprisingly, a noisy pitch sensation with a pitch corresponding to a 
frequency of 1/r, i.e. 200 Hz for r = 5 ms. Again the pitch percept can be 
explained by binaural subtraction: the difference between the two ear inputs 
in Fig. 10.11 is a combfiltered noise with spectral peaks at l/2r, 3/2r, 5/2r, 

The spacing between adjacent peaks is 1/r. By a well-studied extension 

of the pitch-residue phenomenon to noise-like inputs, the resulting residue 
pitch should be 1/r within appropriate frequency and delay ranges. And this 
is indeed what is heard. 
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noise generator delay 




Fig. 10.11. The Fourcin pitch, named after its discoverer A.J. Fourcin. Applying 
a broadband noise to the left ear and the same through a delay to the right ear 
creates, in addition to an apparent displacement of the noise source to the left ear, 
a pitch sensation corresponding to the reciprocal delay. Again the effect can be 
explained by assuming that the auditory system can form the difference between 
the two ear signals which creates a “combfilter” with periodic peaks spaced 1/r 
apart 

Even more impressive is the central pitch created by the circuit sketched 
in Fig. 10.12. Here two independent noise generators and two different delays 
are involved. Again, both ears receive broadband noises that separately have 
no pitch attributes. But there is a subjective binaural pitch that corresponds 
to a frequency of 1 /|t 2 — ti|. 



noise generators delays 




Fig. 10.12. Two independent broadband noises, feeding the two ears of a listener 
via two different delays, create a subjective binaural pitch with a frequency that 
corresponds to the reciprocal delay difference 1 /|t2 — ti|. This surprising effect can 
be explained as shown in Fig. 10.13 



How can we explain this? Does the human auditory processor extract 
both delays separately, form the absolute difference, calculate the reciprocal, 
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and then decide to perceive the corresponding pitch? Our brains can wring 
many miracles, but they are no pocket calculators. So what is going on? 

Figure 10.13 shows a setup that is completely equivalent to that of Fig. 
10.12, using two new noises: — ni(t) -\-n 2 {t) and n^it) = ni{t) — ri 2 {t). 

The redrawing is based on only one, purely mathematical assumption: when 
independent noises are added, the resulting power spectrum is the sum of the 
individual power spectra. 

As the redrawing. Fig. 10.13, shows, one noise, n 2 (t), is applied to only 
one ear and is therefore irrelevant for the perceived binaural pitch height. 
The other noise, is applied to both ears: unfiltered to the left ear and 

combfiltered to the right ear. The combfilter has a peak spacing of 1 /|t2 — ti | 
Hz, corresponding to the perceived pitch. The explanation of this pitch phe- 
nomenon therefore involves something other than mere binaural subtraction: 
our auditory processor can apparently detect binaurally correlated signals 
and extract them from an uncorrelated background. (If either noise source in 
Fig. 10.13 is left out, a monoaural combfiltered noise is created at the right 
ear with the corresponding monoaural pitch perception.) 

These paradoxical pitch perceptions are nicely accounted for by Licklider’s 
duplex and triplex theories of hearing in which binaural signals are analyzed 
according to three dimensions: frequency, delay difference, and periodicity, of 
which the latter two are occasionally confused [10.14]. 

The ultimate confusion between localization and pitch is observed in Di- 
ana Deutsch’s octave illusion in which alternating high and low notes are 
applied to the two ears, and “the wrong note is perceived at the wrong time 



noise generators delays comb filters 




Fig. 10.13. If the two noises in Fig. 10.12 are independent, then Fig. 10.12 can 
be redrawn as shown here, where rii(t) and n 2 (t) are the sum and the difference, 
respectively, of the two noises n\{t) and ri 2 {t) in Fig. 10.12. It is now clear that the 
delay n is immaterial and only the delay difference T 2 — n enters the paradigm, pro- 
ducing combfiltered noises that are known to elicit a pitch sensation corresponding 
to a frequency 1 /|t 2 — nl 
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at the wrong ear,” see Fig. 10.14. The two notes, applied in binaurally alter- 
nating fashion, are in an octave relationship, 400 and 800 Hz, say. For most 
(especially right-handed) listeners the perceived sounds differ from the phys- 
ically present signals in the following respects: the high notes applied to the 
left ear are inaudible; they are only heard at the right ear. The low notes 
are only perceived at the left ear and - paradoxically - at a time when they 
are actually present at the other ear! Switching the two earphones does not 
change the illusion: the earphone that seemed to have been emitting the low 
note is now emitting the high note and vice versa [10.13]! 

The explanation of this stunning illusion is not simple. We have to pos- 
tulate two separate brain mechanisms for pitch and location. Further, we 
have to assume that the perceived pitch is determined by the pitch at the 
right ear (for right-handed listeners) while the tone at the left ear is ignored. 
And finally, we have to suppose that the location of the perceived tone is 
given by the location of the higher tone, regardless of whether the higher 
or lower note is in fact perceived. These assumptions are consistent with 
the neurological evidence that most right-handers process speech in the left 
hemisphere, assuming that these tone pulses are processed by the listeners 
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Fig. 10.14. Diana Deutsch’s octave illusion. Two tones, one an octave higher than 
the other, are fed via earphones to the two ears of a listener, as shown at the top. 
What listeners typically hear is shown below it. Paradoxically, they never hear the 
low note when it is actually present at the left ear; they hear the low note only 
when it is present at the right ear - but they hear it at the “wrong” ear. Similarly, 
the high note is never heard when it is actually present at the left ear; instead the 
low note (present at the other ear) is heard in its place. This confusion of location 
and pitch apparently involves higher brain centers, as it depends on handedness of 
the listeners 
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as speech-like signals and not as complex musical compositions (for which a 
left-ear advantage is usually found). 



10.8 Pseudo-Stereophony 

Having two ears, we naturally prefer stereophonic sounds over monophonic 
sounds - a fact amply exploited by the high-fidelity industry. Curiously, 
stereophonic perceptions can be produced from single-channel audio sig- 
nals by appropriate manipulations. An early example of such “pseudo- 
stereophony” is the Lauridsen effect [10.15]. 

In Lauridsen’s original setup a single (monophonic) sound signal is radi- 
ated from a loudspeaker facing the listener, see Fig. 10.15. In addition, the 
same signal is radiated from a second loudspeaker, more distant from the 
listener, facing sideways and open at the back. Thus, the phase of the sig- 
nal from this second loudspeaker differs by 180° at the listener’s two ears. 
Together with the (in-phase) signal from the closer loudspeaker and, after 





Fig. 10.15. Lauridsen’s stereophonic effect obtained from a single (monophonic) 
signal. The two loudspeakers, if properly equalized, produce interleaving comb-like 
frequency responses at the two ears of a listener facing the loudspeakers. Thus 
roughly half the frequency components of a music signal are perceived at the right 
ear and the other half at the left ear. Our auditory system, not knowing what 
to make of such a confusing situation, apparently gives up on ordinary localiza- 
tion and seems to be telling the brain “everything comes from everywhere” - an 
overwhelming, albeit artificial, stereophonic effect 
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appropriate equalization, two complementary comb-like frequency responses 
are generated at the two ears. In other words, half the spectrum of the mono- 
phonic signal goes to one ear while the other half goes to the other ear. The 
subjective result is a strong “stereophonic” feeling of being totally immersed 
in the sound. The relative delay between the two loudspeakers should exceed 
10 ms for best results. 

If this interpretation of the Lauridsen effect is correct, then a bank of 
contiguous bandpass filters, alternate filter outputs applied to the two ears, 
should also give a pseudo-stereophonic effect. This is indeed the case as 
demonstrated in the late 1950s with an available vocoder filterbank [10.16], 
see Fig. 10.16. 
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Fig. 10 . 16 . A (successful) attempt to duplicate Lauridsen’s stereophonic effect, 
see Fig. 10.15, with a vocoder filterbank. With a monophonic input (left) a strong 
spatial sensation is created for a listener wearing the earphones {right) 



Pseudo-stereophony can even be attained without spectral distortion, 
namely by means of allpass filters [10.17]. Figure 10.17 shows the impulse 
response of an allpass filter originally suggested for artificial reverberation. 
By inverting the signal of every other impulse, another allpass filter with a dif- 
ferent phase response is created. The group delay difference between the two 
allpass filters is shown in Fig. 10.18. In certain frequency ranges the group de- 
lay of one filter exceeds that of the other filter. For the remaining, interleaved, 
frequency ranges the delays are reversed. Thus, if a monophonic signal is fed 
to the two filters and their outputs connected to binaural earphones or stereo- 




196 10. Binaural Hearing - Listening with Both Ears 



h(t) 



IMPULSE 

RESPONSE 





2VT 

■ 

3T ^TIME 



Fig. 10.17. Impulse response of an allpass filter: its frequency response is flat 
but the filter introduces phase changes. For sufficiently long delays, reverberation 
without frequency distortion is produced 




Fig. 10.18. A monophonic audio signal is filtered by the allpass filter shown in 
Fig. 10.17 and applied to one ear. For the other ear the signal is filtered by a 
complementary allpass filter (in which the signs of the pulses in its impulse response 
alternate). The resulting percept is strongly stereophonic. This figure shows the 
group delay difference between the two ear channels: some frequencies are delayed 
at one ear, others are delayed at the other ear, thereby creating the observed effect 



phonic loudspeakers, the listener perceives spatially dispersed sound without 
the spectral distortion inherent in Lauridsen’s setup or combfilter [10.17]. 



10.9 Virtual Sound Images 

In the mid-1960s, B.S. Atal and the author had the idea that sound could 
be made to appear as arriving from any desired direction by appropriately 
filtering a single-channel audio signal before radiating it from just two loud- 
speakers positioned in front of the listener [10.18]. And this is indeed possible. 
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Fig. 10.19. The creation of virtual sound images by linear filtering of a monophonic 
signal. The filter characteristics depend on sound diffraction around the listener’s 
head 

see Fig. 10.19. The effect of a lateral echo was so convincing that many lis- 
teners turned their heads to look for the (absent) lateral sound source. Of 
course, when they did so, the effect disappeared because the two filters are 
designed for a given head orientation. (In fact, even the head shape has some 
influence. For once, it’s not what’s inside the human head that determines 
the outcome of the experiment but its external geometry.) 

By means of such and similar filters John Chowning and others have 
created sounds that seem to swirl around in three-dimensional space, thus 
giving music a third, controlled dimension: pitch, rhythm, and space [10.19]. 
It is even possible to simulate digitally sound transmission in a full-blown 
concert hall - either existing or in the planning stage, thereby reducing the 
risk of expensive design errors. 

Good acoustic quality of a concert hall requires strong laterally traveling 
soundwaves. This was pointed out by Michael Barron, Harold Marshall [10.20] 
and others. And it was also the main result of an investigation of numerous 
(mostly European) concert halls carried out at Gottingen by P. Damaske, V. 
Mellert, D. Gottlob, S. Mehrgardt, U. Eysholdt, and K.F. Siebrasse [10.21]. 



10.10 Philharmonic Hall, New York 

One of the parameters thought to be important for good acoustical quality of 
a concert hall is the initial time gap, i.e. the delay between the arrival times 
at a listener’s ears between the direct sound and the first reflection from the 
ceiling or side walls. To keep the value of this time gap below its upper limit 
for good quality (about 25 ms), acoustic reflection panels were installed over 
the audience area in Philharmonic Hall at Lincoln Center for the Performing 
Arts in New York City, see Fig. 10.20. 

However, in spite of this precaution (or, more likely, because of the acous- 
tic panels) the acoustics of Philharmonic Hall seemed to suffer. One of the 
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Fig. 10.20. Philharmonic Hall (now Avery Fisher Hall) at the Lincoln Center 
for the Performing Arts in New York City (before any acoustic alterations). The 
overhead reflecting panels were found too small to properly reflect low frequencies 
(especially from the celli and double basses) leading to a lack of “warmth.” They 
also emphasized the overhead ( “monophonic” ) component of the sound arriving at 
a listeners ear as opposed to the lateral (“stereophonic”) sound. This led to a feeling 
of detachment for the listeners, a lack of being enveloped by music 



persistent complaints was a sense of detachment from the music and a feel- 
ing of not being enveloped by the sound. The culprits, it turned out, were 
indeed the reflecting panels, which increased the sound energy arriving in the 
symmetry plane of the listeners’ heads thus emphasizing the “monophonic” 
effect of the hall. 



10.11 The Proper Reproduction of Spatial Sound Fields 

To get at the source of the acoustic quality problems, a group of physicists at 
the University of Gdttingen, with the support by the German Science Foun- 
dation (DFG), investigated 22 halls in Europe and beyond. To ensure iden- 
tical music signals for the subjective comparisons, reverberation- free music 
(recorded for this purpose by the London Chamber Orchestra in an anechoic 
environment) was radiated from the stages of the halls under investigation 
and recorded by means of a specially designed “Kunstkopf,” see Fig. 10.7 
center. 

The recorded signals were radiated from two loudspeakers fed through 
a “crosstalk compensation” filter system which transfers the left (right) -ear 
signal exclusively to the left (right )-ear with negligible crosstalk for listen- 
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Fig. 10.21. Acoustic preference space for four different concert halls (designated 
by the letters E,P,Q,T) and a total of 10 different listening locations in these 
halls (El through T3). A recording of classical music (Mozart’s Jupiter Symphony) 
was played from the stages of these halls and recorded with a dummy head. The 
recordings were reproduced by a special sound system in an anechoic space and 
evaluated by experienced listeners in paired comparison preference tests. A three- 
dimensional “preference space” was constructed from the subjective judgements by 
multi-dimensional scaling. The first two dimensions are shown in this figure. The 10 
different listeners are represented here by 10 vectors pointing in different directions. 
The different halls and seats [Ei through T3) are arranged in this space in such a 
manner that their normal projections on the different listeners’ vectors reproduce 
the preference scores with minimum error. (The two dimensions shown here account 
for more than 80% of the total variance.) Because all listeners’ vectors (except 
listener 4) point into the right half plane, the abscissa can be labelled “consensus 
preference.” The ordinate reflects “individual differences” in musical tastes of these 
listeners 



ers whose head shapes are within a certain range of the standard used for 
calibrating the crosstalk filters, see Fig. 10.6. 

The results of thousands of paired comparisons were analyzed by multi- 
dimensional scaling. The two main dimensions of the resulting three-dimen- 
sional “preference space” are shown in Fig. 10.21. 

The ten arrows are unit vectors representing ten different listeners. Each 
letter/number pair refers to a specific hall/seat location. Projecting the 
hall/seat points normally (at right angles) onto a listener’s vector repro- 
duces that listener’s preference scores (within the 85% of the total variance 
accounted for by the first two dimensions of the solution). 
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Fig. 10.22. Correlation of the subjective preference dimensions, Di and L> 2 , with 
two acoustical parameters (reverberation time and interaural similarity) and one 
architectural measure (width). Whereas reverberation time is, as expected, posi- 
tively correlated with the consensus preference dimension (Di), interaural similar- 
ity shows a strong negative correlation. This means that people prefer stereophonic 
sound over monophonic sound. The negative correlation of the halls’ widths can be 
explained by the weakness of lateral sounds that such hall shapes engender leading 
to a more monophonic sound 



Except for listener 4, all listeners’ vectors point into the right half-place. 
Thus, the horizontal axis can be called “consensus preference” because if (by 
some architectural modification, for example) a hall/seat point is moved to 
the right, all listeners (except listener 4) would prefer the new condition. The 
fact that some listener vectors point up, while others point down, reflects their 
individual differences in musical taste. The vertical dimension has therefore 
been called “individual difference” . 

To extract further useful information from this data, the two main sub- 
jective preference coordinates, Di and D 2 , have been correlated with various 
objective (acoustic and architectural) parameters such as reverberation time 
measured at the corresponding hall/seat, interaural similarity measured at 
the “dummy’s” ears, and the (average) width of the hall, see Fig. 10.22. 
As expected, reverberation time showed a strong positive correlation with 
consensus preference, independent of musical taste. 



10.12 The Importance of Lateral Sound 

Another strong, albeit negative, correlation is found between interaural sim- 
ilarity and consensus preference. This means that ear signals that are too 
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similar (“monophonic”) are bad for good acoustics. The negative correlation 
for the widths of the hall is a result of this “stereophonic” preference because 
wide halls produce weak lateral sound. These findings may in fact explain why 
so many modern halls are acoustically disliked: larger audiences and “wider” 
people force the construction of wider halls. In order to lower building costs, 
modern ceilings are much lower than in the old-style “shoe-box” halls such 
as the Vienna Musikvereinssaal. (The air for breathing in a modern hall is 
typically supplied by an air-conditioning system.) These two trends -- wide 
halls and low ceilings - conspire to diminish the energy of laterally traveling 
sounds relative to the sound energy arriving in the symmetry plane through 
the listeners’ heads, leading to the deleterious “monophonic” sound in many 
a modern hall [10.22]. 

The importance of lateral sound for good acoustic quality may be further 
confirmed by a method called digital modification. Instead of tearing a hall 
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Fig. 10.23. A pair of binaural impulse responses showing sound directly transmit- 
ted from the stage to the listener’s ears (left) and an original reflection from the 
ceiling arriving simultaneously at the two ears. Also shown are two artificial reflec- 
tions added by computer, one from the left (arriving at the left ear first) and one 
from the right (arriving at the right ear first). Listening tests with such digitally 
modified impulse responses, when convolved with music signals, demonstrated the 
possible improvements in acoustic quality of concert halls (without tearing down 
any walls and rebuilding) 
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Fig. 10.24. The author evaluating virtual, digitally created sound fields in the 
anechoic chamber (“free-space room”) of Bell Laboratories at Murray Hill, New 
Jersey 



down and building a new one, lateral echoes can be added to the recorded 
impulse response of a hall/seat on the computer, see Figs. 10.23 and 10.24. 
In subjective tests of halls with such digitally enhanced lateral sounds the 
preference always increased. 



10.13 How to Increase Lateral Sounds in Real Halls 

Given that building costs and audience size (in both senses of the word “size” ) 
forbid reverting to the old style narrow- and- high halls, what can one do to 
enhance laterally traveling sounds? One possible answer is the introduction of 
“corrugated” surfaces (on the ceiling and elsewhere) that scatter an incoming 
wave into a broad pattern of reflected wavelets, see Fig. 10.25. Sound-diffusing 
surfaces are also important for recording studios, including those for speech 
signals, to diminish standing waves between opposite walls that interfere with 
musical sound quality and make accurate speech signal analysis more difficult. 

For the scattering of a single wavelength. A, the optimal depth of the 
corrugations is A/4 corresponding to a phase shift of the reflected wave by 
180° or half a wavelength. The spatial sampling theorem dictates that the 
lateral spacing (“grating constant”) should not exceed A/2 for eflFective scat- 
tering by ±90°. But where should the surface be up and where down? The 
answer comes from a branch of number theory called finite fields or Galois 
fields, abbreviated GF{p^), where p is a prime number and m is a natural 
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Fig. 10.25. Sound-diffusing surface for a single wavelength, based on a number- 
theoretic formula (primitive polynomials in finite fields over the prime number 2) 



number [10.23]. For p = 2 and m = 3, for example, resulting in a periodic 
scattering surface with a period length of p'^ — 1 = 2^ — 1 = 7, one has to 
factor the polynomial + 1 into irreducible factors over GF{2^): 

x'^ 1 = {x l){x^ -h + l)(x^ + X + 1) . 

(The reader who intends to check the factorization should bear in mind that 
1 -f 1 = 0 for p = 2.) From these factors one has to select one that does not 
occur as a factor in + 1 for n <7. Such a factor, x^ + x^ -b 1 or x^ -f x + 1, 
is called a primitive polynomial. By setting one of these, say x^ + x + 1, equal 
to zero, one obtains the equation 

x^ = X + 1 , 

which is then considered as a generating function for a recursion relation 
yielding 

^n-|-3 — ^n-fl CLn • 

With the initial values ai = U 2 = as = 1, for example, a periodic sequence 
with period length 7 is generated: 

an = l,l,l,0,0,l,0; 1,1,1,... . 

The an are converted to reflection coefficients by the formula 
Tn = exp(i27ran/p) 

yielding, with the above an and p = 2, 

r, = -l, -1,-1, 1,1, -1,1; -1,-1,-!,... . 
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Fig. 10.26. Diffraction pattern from the surface shown in Fig. 10.25. A single inci- 
dent plane wave is scattered into 7 different directions with nearly equal amplitudes 



The two different reflection coefficients (1 and —1) are realized by troughs 
of depth A/4, as shown in Fig. 10.25. Figure 10.26 shows the angular scatter 
from such a surface, made out of sheet metal, irradiated by a 3-cm microwave. 
(There is no essential difference of the scattering from such surfaces of prop- 
erly polarized microwaves and sound waves.) As expected, the incident wave 
is scattered into many different directions covering the entire space. 

However, the “one-step” corrugation shown in Fig. 10.25 works only for 
“one” frequency (actually about one octave). To cover more octaves, as 
needed for speech and music applications, one has to use more than one 
step-size in the diffusor. There are two solutions: one is based again on prim- 
itive polynomials over GF{p^) with p a prime larger than 2. For example, 
for p = 11 and m — 1 one (of the four) primitive polynomials over GF{1F) 
is ic + 9. Setting x T 9 equal to 0 gives x = — 9 or, modulo 11, x = 2, called a 
primitive root. Thus the recursion is a^+i = 2an- Starting with ai = 2, the 
periodic Galois sequence is 

= 2, 4, 8, 5, 10, 9, 7, 3, 6,1; 2,... . 

It has the proper period- length p'^ — 1 = 11^ — 1 = 10. It also has 10 different 
values (from 1 to 10) and a frequency range of about 1 to 21 (10.5/0.5 = 21), 
i.e. more than four octaves, are covered. 

Another solution to the scattering problem for broad frequency ranges is 
based on quadratic residues, another number-theoretic concept [10.23]. The 
quadratic residues modulo the prime p = 7, for example, are given by the 
periodic sequence with period length 7, 

a^ = 0,l, 4,9 = 2,16 = 2, 25 = 4,36 = 1; 0,1,4,... . 

Here = stands for “congruent to,” i.e. the remainder after dividing by 7. To 
convert these numbers to local reflection coefficients, r^, one uses again the 
equation = exp(i27ran/p). The required phase shifts are once more realized 
by corrugations or troughs of different depths. Figure 10.27 shows a surface 
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Fig. 10.27. Reflection phase grating for scattering sound waves containing many 
musical octaves, based on quadratic residues modulo the prime number 17 



based on the prime number p = 17, effective for a frequency range of 0.5 to 
16.5 exceeding five musical octaves. 

Figure 10.28 shows the scattering of an incident plane wave from this sur- 
face. Again, the reflected energy is distributed over the entire space. There is 
no single strong reflection, only many weak ones. No wonder these quadratic- 
residue diffusors “make a wall disappear” as one apt description has it. 

For even more effective scattering, two-dimensional reflection phase grat- 
ings and even fractal structures have been successfully employed [10.24,10.25]. 




SPECULAR 

REFLECTION 



INCIDENT WAVE 



Fig. 10.28. Angular reflection dia- 
gram showing the broad sound scatter- 
ing of a single incident plane wave 



10.14 Summary 

The spatial attributes of acoustic fields are an important factor for good 
sound reproduction, including good acoustics for lecture halls, rooms for tele- 
conferences and concert halls. Beyond this, the three-dimensional aspects of 
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sound perception offer the modern composer a new dimension to exploit 
creatively. 

Binaural listening is a delightful subject with many surprises and useful 
applications as well as some perplexing paradoxes that instruct us about the 
workings of our brains. 




11. Basic Signal Concepts 



In this chapter we introduce the basic concepts that govern signal analy- 
sis for both continuous and discrete signals, including Fourier and Hilbert 
transforms, correlation functions, and the cepstrum. 

The sampling theorems for lowpass and bandpass signals play a central 
role, allowing the description of bandlimited signals in terms of discrete sam- 
ples. Since most of our discussions are in terms of discrete signals, we pay 
special attention to the z-transform, its properties and applications. 

Many methods of signal analysis transform the input from one domain 
into another, typically from the time domain to the frequency domain or vice 
versa. These transformations are accomplished by one form or another of 
Fourier’s famous transformation: real (sine and cosine) transforms, complex 
transforms, and the fast Fourier transform (FFT). For finite blocks of sam- 
ples, the discrete Fourier transform (DFT) replaces the customary Fourier 
integral and Fourier sum. The importance of the DFT is further highlighted 
by the fact that digital computers and digital signal processors inevitably 
deal with finite sets of discrete data. 

A considerable broadening of our view of signals is brought about by 
the introduction of the Hilbert transform of a signal and the closely related 
analytic signal These useful concepts lead in turn to the definition of the 
Hilbert envelope and phase, and the instantaneous frequency. 



11.1 The Sampling Theorem 
and Some Notational Conventions 

In speech signal analysis we are dealing with functions of time that represent, 
typically, sound pressure in air or a voltage on an electric conductor. Generic 
symbols for representing such functions are lower case letters like s and x. 
Continuous time is represented by the letter t. Thus s{t) is our generic speech 
signal. 

Before we can feed a signal into a digital computer or a digital signal pro- 
cessor (DSP), which accept only discrete, finite-precision numbers, we have 
to bandlimit the signal so that it can be represented by a discrete sequence, 
which we write s[n]. Here n represents discrete time, = to + nT, where 
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T is the sampling interval. Nyquist’s sampling theorem [11.1] tells us that T 
should be smaller than where B is the (one-sided) frequency band- 

width of the signal: 



T < 1/2B . 



( 11 . 1 ) 



Note that a sampling interval T = 1/2B is already in violation of the sampling 
theorem. The reason behind the inequality (11.1) of the sampling theorem is 
that sampling a signal corresponds to multiplying by a periodic spike train. 
This multiplication superimposes multiple replicas of its Fourier transform on 
itself with a frequency spacing Af equal to 1/T. To avoid overlap between two 
adjacent replicas, Af has to be larger than the total (two-sided) bandwidth 
2B of the signal (including both positive and negative frequencies); thus 

> 2B. Because T = l/Af, the sampling theorem for signals (11.1) follows 
immediately. 

To go from the samples 5[n] to the continuous function s(t), we need an 
interpolation formula [11.2]: 

s{t) = ^ s[kT] sine (^^{t - kT)^ 
here the function sinc(x) is the so-called sine function: 



sinc(x) 



sin(7Tx) 

7TX 



It is the Fourier transform of a sharply bandlimited pulse. The sine function 
was first introduced in wave optics to describe the diffraction of coherent light 
at a sharp slit in an opaque screen. (Hence its German name Spaltfunktion, 
from Spalt meaning slit.) 

Note that in going from the continuous signal s{t) to the sampled version 
s[n], we have kept the same letter, s, for the function but indicated a change 
of variable by switching from parentheses ( ) to brackets [ ]. This seems a 
good notational convention. 



11.2 Fourier Transforms 



No mathematical transformation is as basic to linear signal analysis as the 
Fourier transform. For a continuous, square-integrable^ signal s{t), we define 
the Fourier transform 5 (cj) as follows: 



/ oo 

s{t) exp{—iujt)dt . 

-oo 



( 11 . 2 ) 



^ A function is called square-integrable if it has “finite energy,” that is the integral 
from — oo to +oo over its absolute square exists. A sine wave of infinite duration or 
a constant-power noise are not square-integrable unless properly “windowed.” 
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The Fourier variable uo is called angular frequency or radian frequency; it 
equals 27 t times the frequency, which is usually measured in Hertz, abbrevi- 
ated Hz. 

In (11.2) we have again kept the same letter, 5, embellished by a “hat,” or 
circumflex, s, to emphasize the close connection and the one-to-one relation 
between a signal s and its Fourier transform s. 

The absolute square of s{u;) is called the energy spectrum of s{t) because 
it tells us how the energy of s{t) is distributed over the different angular 
frequencies. Note that the energy spectrum is shift-invariant^ that is, it does 
not change with phase or time shifts - an important “symmetry” of the 
Fourier transform [11.3]. 

Another reason for the importance of the Fourier transform is that it con- 
verts differential and integral equations into algebraic equations^ which are 
usually easier to solve. In fact, differentiation simply corresponds to multiply- 
ing the Fourier transform by io;, and integration (if legitimate) corresponds 
to a multiplication by (ic<;)“^. 

Jean Baptiste Joseph Fourier (1768-1830), who solved the differential 
equation for heat conduction with the mathematical transformation named 
after him encountered stubborn opposition from many of his fellow mathe- 
maticians who wouldn’t believe his claim that any function could be repre- 
sented by a Fourier series. This debate has not entirely abated. While sujfi- 
cient conditions for the existence of the Fourier transform are known, the jury 
on necessary conditions is still out (200 years after the original “hearing”). 
However, the (occasionally acrimonious) altercation has spawned much good 
mathematics and led to a considerable broadening of the function concept. 
Just think of such “insane” creations as the continuous but nowhere differen- 
tiable functions concocted by Karl Weierstrass (1815-1897), an early example 
of what is now called a fractal [11.5]. 

As a result of this work on the foundations of analysis, we now have the- 
ories of integration free of contradictions. And various sufficient conditions 
for Fourier integrals to exist are known. For example, the signal may have 
discontinuities (even infinitely many discontinuities if they have no accumu- 
lation points). In such cases, the inverse Fourier transform will give a value 
that is the average of the two values just to the left and the right of the 
discontinuity. 

If desired, the Fourier transform can be decomposed into its real and 
imaginary parts. For real signals, we have 



/ oo 

s{t) cos(cjt) dt 

-oo 

[ 

/ oo 

s{t) 8in{(jjt) < 

-oo 



(11.3) 



(11.4) 
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Note that for symmetric real signals, s{t) = s{—t)^ the imaginary part of the 
Fourier transform vanishes (because is antisymmetric). 

The inverse Fourier transform gives s{t) in terms of 

s{t) = — / 5exp(io;t) dc<; . (H-5) 

Note the factor 1/27 t and the “missing” minus sign in (11.5) compared to 
(11.2). For periodic signals s{t) = s{t kT) with fundamental angular fre- 
quency uoq = 27t/T, the integral is extended over a single period T = 27 t/u;o of 
the signal, resulting in a Fourier transform with discrete Fourier components 
(spectral “lines”). 

— / s(t) exp(— imcjo^) dt , (11.6) 

where the integer m is called harmonic number. Here again, we have expressed 
the change of variable (from continuous angular frequency lu to discrete har- 
monic number m) by switching from parentheses to brackets while keeping 
the same symbol, s, for the function. 

The absolute square of s[m] is called a line spectrum^ as opposed to a 
continuous spectrum. The line spectrum, a term borrowed from optical spec- 
troscopy, shows how the signal power is distributed over the different har- 
monics. 

The corresponding inverse Fourier transform is given by a sum over com- 
plex exponentials: 

oo 

»(«)= E s[m]exp{imuJot) . (H-l") 

m= — oo 

This is the famous Fourier series which describes a periodic signal (think of 
a note played on a violin) in terms of its fundamental frequency component 
and its harmonics. (Here the Fourier series is given in the complex notation 
where each physical frequency component is represented by two terms, one 
positive, m > 0, and one negative, m < 0.) 

Note that even if s{t) was not periodic, the inversion formula (11.7) will 
produce a signal that is periodic with a period 27rj(jOo. In other words, the orig- 
inal s{t) is continued periodically outside the analysis interval [to, to + 27r/cc;o]- 
(This simple fact has occasionally been ignored with rather peculiar conse- 
quences; for example, in signal detection theory “ideal observers” were created 
that can detect signals in noise at arbitrarily low signal-to- noise ratios, which 
is of course hocus-pocus.) 

Another discipline in which the Fourier transform has found important 
applications is probability theory. The characteristic function of a probability 
distribution p{x) is in fact its Fourier transform (except that the minus sign 
in the exponent is usually omitted) . When two independent random variables 
are added, their probability distributions are given by a convolution integral 
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(or sum) . But convolution means simple multiplication in the Fourier domain 
(see Sect. 10.4). Thus characteristic functions are multiplied (and their loga- 
rithms are added) when independent random variables are added, a highly 
satisfactory state of affairs. Proofs of the Central Limit Theorem of proba- 
bility theory ( “everything” tends to a Gaussian distribution) make judicious 
use of this connection. 

One precondition for the validity of the theorem is that the probability 
distributions involved have finite variances. But the Fourier transform can 
even make a contribution when variances don’t exist, as in the case of the 
infamous Cauchy distribution 

x) = - 

7t( 1 -h X^) 

whose second moment is infinite. Incredibly, when averaging Cauchy-distrib- 
uted variables, estimates of the mean of the distributions do not improve. In 
fact, the average of two (or more) identically distributed Cauchy variables 
has the same (and not a narrower) Cauchy distribution. This follows from 
the fact that the Fourier transform of p{x) equals 

P{0 = 

and averaging leads to exactly the same expression. 

Probability distributions that preserve their functional form when random 
variables are added play a prominent role in statistics and are called invariant 
or infinitely divisible distributions [11.5]. The best known example is the 
Gaussian distribution: the sum of two Gaussian variables is again Gaussian. 
For independent variables, the variance of the sum is the sum of the individual 
variances. In other words, the squared width of the distribution of the sum 
is the sum of the individual squared widths. 

But the Gaussian distribution is far from alone in the lofty realm of invari- 
ant distributions. The Gauchy distribution, too, is preserved when random 
variables are added, but the scaling law is different: the widths themselves 
add, not their squares. For other invariant distributions some other power D 
of the widths, 0 < D < 2, are added with some rather counterintuitive conse- 
quences. For example, for D = 0.5, averaging two random variables increases 
the width by a factor 2 [11.4]. Benoit Mandelbrot, in his recent Fractals and 
Scaling in Finance^ parades a whole panoply of scaling paradoxes before the 
unsuspecting reader [11.6]. 



11.3 The Autocorrelation Function 

Autocorrelation is one of the most potent concepts in signal analysis. The 
autocorrelation function a{r) of a square-integrable signal s{t) is defined by 

/ oo 

s{t)s*{t — r) dt , 

-oo 



( 11 . 8 ) 
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where s* is the complex conjugate of s. For real signals, s* = 5 and the star 
superscript can be omitted. As can be seen from (11.8), a(r) is shift-invariant^ 
i.e. it is not affected by a time shift (delay) of the signal s{t). 

The autocorrelation function of a signal measures its correlation at two 
instants of time separated by a time interval r. From the definition (11.8) it 
follows immediately that a{—r) = a*(r). For real signals, the autocorrelation 
function is itself real and therefore symmetric: a(— r) = a(r). 

The autocorrelation function a{t) of a signal s{t) is related to its Fourier 
transform s(c(;) by the famous Wiener-Khinchin relation [11.7]: 

a(r) < — ^ |s(w)|^ . 



Here the two-sided arrow means that a(r) and the absolute square of the 
Fourier transform, |s(c<;)P, the energy spectrum^ are related by a Fourier 
transformation and its inverse; in other words, they form a Fourier pair: 



and 



/ oo 

a(r) exp(— icjr) dr 

-CX) 

;(r) = ^ y |s(w)pexp(iwr)d 



(11.9) 



( 11 . 10 ) 



The energy spectrum of a signal measures its “frequency content” . For orches- 
tral music the frequency content may exceed the range of the healthy human 
ear (20 Hz to 20 000 Hz). Old-fashioned (and some not so old-fashioned) tele- 
phone speech signals are limited to frequencies between about 300 Hz and 
3000 Hz. 

The value of the autocorrelation function for r = 0, according to (11.8), 
is equal to its total energy, that is the integral of |«5(t)p over all times: 



^(0) = / \s{t)\‘^ dt . 

J — oo 

On the other hand, according to (11.10): 



1 7^ 

«(0) = ^y |s(w)|^dw. 



( 11 . 11 ) 



( 11 . 12 ) 



If we interpret (l/27r)|s(o))p as the energy density at angular frequency to, 
then a comparison of (11.11) and (11.12) tells us that we can measure the 
energy of a signal in either the time domain or in the frequency domain and 
get the same result. The equality 




(11.13) 



is a special case of ParsevaVs theorem: 



/: 



Si{t)s2{t)dt 




5i(cj)52(o;)da; , 



(11.14) 
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named after Marc- Antoine Parseval des Chenes (1755-1836) who derived it 
as early as 1799 (and then had to flee France to escape Napoleon’s wrath over 
some of his poems, considered too critical of Bonaparte’s regime). Equation 
(11.14) shows that the Fourier transform is a mathematical mapping that 
preserves inner products as deflned in (11.14) [11.8]. Parseval’s theorem is 
a consequence of the completeness of the Fourier transform; it would not 
be true if some frequencies present in the signal were not represented in its 
Fourier transform. 

For sampled (time-discrete) signals s[n]^ we speak of an autocorrelation 
sequence^ deflned by 

oo 

(^[k] '= ^ s[n] • 5*[n — /c] . (11.15) 

n= — oo 

Among the innumerable applications of autocorrelation analysis we mention 
the detection of periodicities in signals, such as the fundamental frequency 
(“pitch”) in speech signals [11.9]. Autocorrelation analysis is also basic to 
linear prediction. 

The second derivative of the autocorrelation function at r = 0 is pro- 
portional to the second moment of the energy spectrum. Thus, an important 
spectral parameter can be measured without Fourier analysis. More generally, 
the 2mth derivative is proportional to the 2mth spectral moment. 

According to a famous formula by S. O. Rice [11.10] for a Gaussian pro- 
cess, the second derivative (of the properly defined) autocorrelation function 
is also related to average rate of zero crossings. Thus, for example, the zero- 
crossing rate of such a process, sharply lowpass filtered at 3 kHz, is 3464 zeros 
per second. In general, the zero-crossing rate of a bandlimited signal does not 
exceed twice the upper cut-off frequency. (But B. F. Logan has shown that, 
incredibly, there exist signals whose number of zeros in a given finite time in- 
terval can be arbitrarily high, no matter how low the upper cutoff frequency! 
However, these are highly pathological signals that blow up exponentially 
outside the time interval considered.) 



11.4 The Convolution Integral and the Delta Function 



One of the most frequent operations that “ties” two functions, si{t) and 52 (t), 
together is the convolution integral: 

/ oo 

Si{t)s2{T - t)dt . (11.16) 

-OO 



Because of its frequent occurrence, the convolution integral is often abbrevi- 
ated by a “convolution star” : 



si{t)s2{r - t)dt . 



Si 'k S2 := 



— oo 



(11.17) 
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Thus, instead of (11.16), we may write 

b{r) = si'k S 2 . (11.18) 

Often, one of the functions, say S 2 (t), is the impulse response of a time- 
invariant passive linear system such as an electrical or mechanical filter. The 
impulse response S 2 (t) is defined as the response at the output of the system 
at time t when a delta function impulse, 6{t), is applied at its input at time 
t = 0. The delta function 6{t), also called Dirac function after the British 
physicist Paul Adrien Maurice Dirac (1902-1984), who introduced it to fa- 
cilitate computations in quantum mechanics, is defined as a function that 
equals 0 everywhere except at t = 0, where it is infinite, and whose integral 
equals 1: 



S{t) = 0 ^oT t ^ 0 

/ oo 

6{t)dt = 1 . 

-CXD 



(11.19) 



The Dirac function, the mathematical idealization of a short pulse, is a useful 
tool in many areas of engineering and physics, including thermodynamics. It 
often emerges naturally as the result of a limiting process. For example, it 
can be obtained from the normal distribution (27rcr^)~^/^ exp(— t^/2cr^) by 
letting its width (standard deviation) cr, go to zero while its value at t = 0 
grows without bound. 

If instead of a Dirac pulse 6{t) another function, Si(t), is applied to the 
input of the system, then linear superposition results in an output at time 
r given by a convolution integral (11.16). Thus, a linear system is said to 
convolve (please don’t say “convolute”) an input function with its impulse 
response. 

One of the properties of the delta function is that it can “select” the value 
of any other function s{t) for any given time, say P: 

/ oo 

s{t)6{t — t')dt . (11.20) 

-oo 



From this property and (11.2) we see that the Fourier transform of S{t) is a 
constant: 



S{lj) = 1 . (11.21) 

Thus, the Dirac function contains all frequencies between — oc and -hoc with 
equal amplitude and zero phase; it is the ideal(ized) test function in linear 
system analysis. 

It is interesting to note that the autocorrelation function of a white noise, 
n(t), is the Dirac delta function. Writing correlation as a time-reversed con- 
volution, we have: 

n{t) ^ n{—t) = 6{t) , 

which, as we just saw, contains all frequencies with equal strength. 



( 11 . 22 ) 
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Fourier transforming the right side of (11.16) with respect to the variable 
r yields one of the more interesting results of system analysis: 

b{uj) = si(a;) • 82 ( 00 ) • (11.23) 

This means that, for linear systems, the Fourier transform of the output 
function equals the product of the Fourier transforms of the input function 
and of the impulse response. Fourier transformation has turned the often 
difficult operation of integration into a simple product! Or, in a manner of 
speaking, it has transformed the convolution star of (11.18) into the simple 
multiplication dot of (11.23). This feat is yet another reason for the power 
and prevalence of the Fourier transform. Conversely, the Fourier transform 
s(lc;) of the product of two signals, s(t) = si (t)-S 2 (t), is given by a convolution 
of their Fourier transforms: 

I foo ^ 

5 ( 0 ;) = — / si(i?) • 82 ( 0 ; — i?)di7 = — 5i(c<;) ^ 52 (cc;) , (11.24) 

ZTT 27T 

a relation that is quickly confirmed by an inverse Fourier transformation 

(11.5). 

For time-discrete signals the convolution integral is replaced by a convo- 
lution sum: 

00 

b[n] := ^ si[m] ’ S 2 [n — m] . (11.25) 



11.5 The Cross-Correlation Function 
and the Cross-Spectrum 

The cross-correlation function c(r) is a generalization of the autocorrelation 
function. It measures the linear dependence between two different functions 
of time, si(t) and 52 (t), at two instants of time, separated by a time interval 
T, for example the size of the harvest in the fall and the rainfall during the 
preceding summer. The cross-correlation function is defined by 

/ oo 

si(t) • S 2 (t — r)dt . (11.26) 

-00 

For Si = 52, the cross-correlation function turns into the autocorrelation 
function. 

It is interesting to note that the cross-correlation at r = 0 between a 
random ergodic signal s(t) with a symmetrical distribution around s = 0 
and its square, s^(t), is zero because the integral over s • = s^(t) vanishes. 

However, the correlation coefficient between a random variable and its cube 
(third power) can be substantial. (The correlation coefficient c between two 
random variables, a and 6, is defined as c = (a — d)(b — 6)/cTa(j^,, where 
the horizontal bars signify expectations (averages) and and are the 
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standard deviations of a and 6, respectively.) For a uniformly distributed 
random variable, the correlation coefficient between it and its cube is 92%. 
For a Gaussian signal it is 77% and for a sinewave a whopping 95% ! (So 
much for the idea that correlations near 100% are indicative of strong linear 
dependence.) 

Comparing (11.26) with (11.16), we discover a formal similarity between 
cross-correlation and convolution: replacing 52 (t) in (11.16) by we see 

that c(r) becomes b{r). Thus, for real signals, cross-correlation and convolu- 
tion differ by a time inversion in one of the two signals. 

Fourier transforming (11.26) yields the so-called cross- spectrum: 

c{uj) = 5i(o;) • 52(^) • (11.27) 

Measurement of the cross-correlation function is often the most convenient 
way to determine the impulse response h{t) of a linear system. A white noise 
signal n{t) is applied to the input of the system and its output a{t) (the con- 
volution of the input with the impulse response) is cross-correlated with the 
input white noise. Since cross-correlating and convolving are interchangeable 
operations, the result is the correlation of the white noise with itself [a Dirac 
delta pulse, see (11.22)] convolved with the impulse response, i.e. the impulse 
response itself. Writing correlations as convolutions with an inverted time 
direction, we have 

a{t) = n{t) ^ h{t) 

a{t) 'k n{—t) = n{t) ^ n{—t) ★ h{t) = S{t) ^ h{t) = h{t) . 

Instead of white noise with a Gaussian amplitude distribution from a thermal 
source (e.g. a resistor at room temperature), a preferred noise source these 
days is the binary white noise from a shift-register with feedback [11.11]. 
With proper feedback, the shift-register will produce a periodic sequence of 
binary (0 or 1) pulses with a period length of 2"^ — 1, where m is the number 
of register stages. A single period will contain all possible combinations of 
m Os or Is except the all-zero m-tuple. The circular autocorrelation function 
of such a sequence in the dzl alphabet (where 0 is replaced by 1 and 1 by 
— 1) equals 2^^ — 1 for zero shift and a constant —1 for all other shifts that 
do not equal an integer number of period lengths. The off- values (—1) of the 
autocorrelation function are therefore as small as possible in absolute terms. 
(They cannot equal 0 because the sequence length is odd.) Such sequences 
are also called low- autocorrelation sequences^ with applications in many fields 
(radar, sound diffusors, error correcting codes, system analysis, and antenna 
design) [11.12]. 

The Fourier transform of such a sequence has constant magnitude for 
all nonzero frequencies. The power spectrum equals 2'^ for all frequencies 
except zero frequency (“dc”) [11.12]. Such sequences therefore have an ex- 
act fiat spectrum, as opposed to thermal noise, which has only an expected 
flat spectrum (approximately flat only with sufficient averaging). Such se- 
quences are of course deterministic rather than truly random and therefore 
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also called pseudo-random or reproducible noise. These noises have important 
applications in psychoacoustics (hearing research) because auditory tests of- 
ten depend on the exact shape of the short-time spectra of signal and noise. 



11.5.1 A Bit of Number Theory 

The proper feedback connections are obtained from so-called primitive poly- 
nomials in finite number (Galois) fields GF{p^). Here p is a prime number 
(p = 2 for binary noise) and m is an integer that determines the period length. 
For example, for m = 50 and a sampling rate of 10 kHz, the period length 
exceeds 3567 years, i.e. the noise is aperiodic for all practical (nonbiblical) 
purposes. For system analysis, the period can be much shorter, but it must 
be longer than the significant memory of the system. For concert halls with a 
reverberation time of 2 s, a sufficient value of m is 16, giving a period length 
exceeding 3 s at a sampling rate of 20 kHz. 

A primitive polynomial in a Galois field GF{p^) is defined as an irre- 
ducible (nonfactorizable) polynomial which divides without remainder — 1 
with n = p'^ — 1 but for no smaller value of n [11.13]. For example, for p = 2 
and m = 4 the polynomial 1 is primitive because it cannot be 

factored, divides x^^ — 1 but does not divide — 1 for n < 2^ — 1 = 15. By 

contrast, the irreducible polynomial 1 + x T divides x^ — 1 and 

is therefore not primitive. If used in a shift-register with feedback it would 
result in a sequence of pulses with period length 5 rather than 15. As a conse- 
quence not all fifteen possible 4-tuples would appear and the power spectrum 
would not be fiat. (And used as a noise source for synthetic speech - as has 
been done - the fricatives will sound awfully discolored.) 

The primitive polynomial determines the feedback connection applied to 
the shift-register. For 1 -h x^ -h x^, for example, the outputs of the third and 
fourth register are fed to an “exclusive or” (XOR) gate (the logic equivalent 
of binary additions without carry) . The output of the XOR gate is fed back 
to the input of the first register. 

Number theory guarantees the existence of irreducible and primitive poly- 
nomials for all primes p and integers m. In fact, there are 

irreducible polynomials, where the sum is extended over all divisors d of m 
and /i() is the Mobius function. For p = 2 and m = 4, there are three 
irreducible polynomials. 

The number of primitive polynomials is 

- 1) , 
m 

where (j){) is Euler ^s totient function^ which counts the number of integers 
relative prime to and smaller than its argument. With 0(15) = 8, there 




218 11. Basic Signal Concepts 



are exactly two primitive polynomials in GF(2^), namely the trinomials 1 + 
and 1 x x^. For p = 2 and m = 16, there are 2048 primitive 
polynomials, many of which have more than three terms. Maximum- length 
sequences generated with polynomials having many terms are advantageous 
for testing nonlinear systems (e.g. loudspeakers) because they have better 
high-order correlation properties. 

Interestingly, if the white noise is Gaussian, one of the signals in the cross- 
correlation may be distorted nonlinear ly and even be infinitely clipped, i.e. 
replaced by its algebraic sign (±1) - a considerable convenience for some 
types of measurements, especially in underwater sound. In another example, 
measuring the impulse response of an animal inner ear, it suffices to cross- 
correlate the acoustic input noise with the nerve spikes on the acoustic nerve 
running from the inner ear to the brain. However, when using non- Gaussian 
noises, such as binary maximum- length sequences, for such measurements, 
nonlinearities do give rise to false results. 

The reason that in cross-correlating Gaussian processes one of the pro- 
cesses may be distorted, for example by clipping, is that, in the frequency 
domain, distortion products at different frequencies are uncorrelated with the 
undistorted process. 

However, nonlinear distortion does change the scale factor of the correla- 
tion. As a result, the ratio 



0 ^ 7i,2(^) 



c(^) 

|siM •s^(a;)| 



< 1 , 



called the coherence function^ is no longer equal to unity, in accordance with 
(11.27). Thus, the cross-spectrum can be used as a test for nonlinearity and 
noise generated inside the system. 



11.6 The Hilbert Transform and the Analytic Signal 

Real signals have two-sided Fourier transforms and power spectra with posi- 
tive and negative frequencies. In fact, for real signals, s{t) = s*(t), equation 
(11.2) tells us that 

s{-uj) = s%u;) . (11.28) 

Moreover, the energy spectrum is a symmetric function of frequency: 

ls(-o;)p = |5(o;)p . (11.29) 

Thus, each positive (physical) frequency is represented twice in the spectrum, 
a fact that results in unwieldiness for certain signal operations such as fre- 
quency and phase shifts. There is also an inelegant lack of symmetry: speech 
and many other signals are real, but Fourier transforms are in general com- 
plex. Let us therefore replace real signals s{t) by complex signals a{t) whose 




11.6 The Hilbert Transform and the Analytic Signal 219 



real part equals s{t). This idea leads to the eminently useful definition of the 
analytic signal 

a(t) := s{t) -\-is{t) . (11.30) 

Here s(t) is called the Hilbert transform of s{t) defined as the principal value 
of the following integral: 

m-=- r (11.31) 



(11.31) 



(Principal value means that a small interval — e < r < e is excluded from the 
integration. Then the limit e ^ 0 is taken.) 

The Hilbert transform converts cosine waves into sine waves and sine 
waves into negative cosine waves. The Hilbert transform therefore effects 
a 90° phase shift. Two Hilbert transforms in tandem correspond to a 180° 
phase shift, that is, a reversal of the algebraic sign of the signal. The definition 
(11.31) is in the form of a convolution integral, to wit: 



s{t) = s{t) * — . 

Fourier transformation thus yields 



5(cj) = s((^) 



(11.32) 



(11.33) 



where s{cxj) is the Fourier transform of the Hilbert transform s{t) and the 
second factor on the right is the Fourier transform of the function l/irt: 

1 , , , 2i smiujt) , 



1 / . N . 2i 

— exp[—iujt)dt = 

irt 7T 



(11.34) 



The integral on the right has a well-known value, namely 0 for cj = 0, tt/ 2 for 
Lo > 0 and — tt/ 2 for oo < 0. That the magnitudes of the integrals appearing 
in (11.34) are independent of cj for a; 0 also follows from a simple scaling 
argument. (Replace t by a new variable u = ujt and nothing is changed.) 
Defining the sign-function: 



sgn(w) := 



1 for > 0 

0 for cj = 0 

— 1 for cj < 0 



(11.35) 



we can write 



s{iu) = -i sgn (cj) • s(a;) . (11.36) 

In the frequency domain, the Hilbert transform therefore corresponds to a 
multiplication by ±i, which is equivalent to a 90° phase shift. With (11.36), 
the Fourier transform of the analytic signal cr(t), equation (11.30) becomes 
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<t(w) 



2a(cu) for cj > 0 

d-{uj) for a; = 0 

0 for a; < 0 . 



(11.37) 



The analytic signal therefore contains no negative frequencies, an impor- 
tant and desirable property, which also has interesting applications in the 
time domain, see Sect. 10.8. 



11.7 Hilbert Envelope and Instantaneous Frequency 

The magnitude of the analytic signal \cr{t)\ is called the Hilbert envelope. 
Writing 

ir(t) = |cr(i)| exp(i<^(i)) , (11.38) 

if{t) is called the instantaneous phase. Its time derivative dp/dt = (p is called 
instantaneous angular frequency. 

For a sine wave s{t) = Acos{u;t + a), the analytic signal equals a{t) = 
Aex.p{iujt + a)). The envelope \cr{t)\ is a constant and equals A. The instan- 
taneous phase, ujt + a, is a linear function of time, and the instantaneous 
angular frequency is uj\ the angle a is a phase constant. 

For most amplitude-modulated signals (for example, AM radio signals), 
\cr{t)\ is the information-bearing signal and varies slowly as a function of time 
(slowly compared to the “carrier” frequency cj, which means that |d|/|cr| <C 
(jj). For phase or frequency-modulated (FM) signals, (p{t) is the information- 
bearing signal and usually varies slowly with time (p ^ lo). 

For narrow-band signals \a(t) \ cos{u;t + P{t)) (narrow compared to the 
carrier frequency cj), the envelope has an immediate intuitive meaning: it 
is the envelope curve one would draw over the tops of the waveform just 
touching each cycle of the carrier near its maximum. But the expression 
“envelope” for |cr(t)| is also justified on more formal grounds: starting with a 
single signal 

s{t) = |cr(t)| cos((/?(t)) , (11.39) 

one can generate an infinite family of signals by phase shifting: 

Sa{t) = \cr{t)\ cos{p{t) + a) 0 < a < 27 t . (11.40) 

The mathematical definition of the envelope of such a family of functions is 
the (non-negative) function that is as small as possible for each value of t 
without intersecting any member of the family. It is not too difficult to see 
that the so-defined envelope is indeed equal to Hilbert envelope \cr{t)\. 

Again for narrow-band signals, the instantaneous angular frequency has 
an immediate intuitive meaning: it equals (approximately) 27r/At, where 
At is the time interval between two successive upward (or downward) zero- 
crossings. In a sine wave At is of course constant and equals 
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Shifting the phase of a signal by a constant angle a, as in (11.40), results 
in simple multiplication of the analytical signal and its Fourier transform by 
a phase factor 

= (j(t) exp(io;) (11.41) 

and 

^a{^) = <3-(a;) exp(io;) . (11.42) 

Frequency shifting is obtained by letting the phase a increase linearly with 
time. If the frequency shift is larger than the one-sided (lower side) band- 
width of the signal, a single- sideband (SSB) signal results. Single-sideband 
modulation is a widely used method of transmitting audio and other sig- 
nals because it requires only half the transmitter bandwidth compared to 
double-sideband amplitude modulation. But SSB transmitters are not per 
se compatible with AM-receivers; they require special receivers - unless a 
method called compatible SSB is employed at the transmitter [11.14]. 

There are two basic methods of realizing frequency-shifting or SSB- 
modulation. For large frequency shifts (larger than the bandwidth of the 
signal), one can first perform a simple amplitude modulation using a carrier 
frequency equal to the desired shift and then remove the lower sideband and 
the carrier by a sufficiently selective highpass filter. 

In another method, which works even for small frequency shifts, one first 
generates the Hilbert transform s{t) = |cr(t)| sin((/?(t)) of the signal s{t) = 
|cr(t)| cos((p(t)) and combines these two signals as follows to yield the shifted 
signal 

SAuj{t) = s{t) Qos{Aijjt) — s{t) 8in{Aujt) = \cr{t)\ cos{(p{t) + Acot), (11.43) 

whose instantaneous frequency is increased by Alj compared to s{t). For ana- 
log circuits, rather than generating the Hilbert transform of a given signal 
(which is impossible anyhow because of the delay implicit in Hilbert trans- 
forming), allpass phase filters are employed that generate, from a given signal, 
a pair of signals one of which is the Hilbert transform of the other. 

Frequency shifting is also used to lower the necessary sampling rate for 
narrow-band signals. Instead of a sampling rate exceeding twice the upper 
cutoff frequency, the maximally down-shifted signal requires only a sampling 
rate exceeding twice its bandwidth B. Interestingly, this saving in sampling 
rate can also be realized without frequency shifting. The trick is to sample 
both the signal and its Hilbert transform at a sampling rate a little above 
H, for a total rate close to 2B. Interested readers may wish to discover by 
themselves how this works. 

Besides single-sideband modulation, frequency shifting has been found 
useful for preventing instabilities (“howling”) resulting from acoustic feed- 
back in public address (microphone-loudspeaker) systems [11.15]. A frequency 
shift of about 5 Hz is nearly inaudible for most speech signals and results in 
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a useful extra stability margin of about 6 dB because the excess acoustic en- 
ergy generated around the response peaks of the room is “dumped” into the 
nearby troughs of the room’s frequency response. Deep troughs are located 
typically 4/T Hz next to a high peak. Here T is the reverberation time, which 
is about 1 s in many lecture halls [11.16]. 

While the envelope lcr(t)| of a finite-bandwidth signal s{t) is usually not 
itself bandlimited, the squared envelope is. In fact, the Fourier transform of 
cr^(t) is given by the convolution of with itself. If the signal s{t) has no 
frequencies above an upper cut-off angular frequency lJu, the corresponding 
analytic signal a(t) has a Fourier transform d(o;) with the same cut-off and 
the convolution of (j(cc;) with itself has an upper cutoff angular frequency of 

2cc^u . 

An even more forceful statement can be made about the Fourier trans- 
form of the absolute squared envelope |cr(t)p of a signal with (one-sided) 
bandwidth B: its Fourier transform is limited to frequencies between —B 
and B, no matter what the upper cut-off frequency lJu- This follows from 
the Wiener-Khinchin theorem which says that taking the absolute square of 
a function in one Fourier domain corresponds to taking the autocorrelation 
function in the other domain (see (11.9) with the variable uj replaced by t, 
and T replaced by cj). And the autocorrelation of the d(c<;), d(a;) being limited 
to the range 0 < cj < is itself limited to —B < uj < B. Referring to the 
Fourier transform s(a;) of the signal s{t) itself, we may say that the Fourier 
transform of the squared envelope of s{t) equals the autocorrelation function 
of 2s{uj) for > 0. (For uj = 0, the factor 2 must be omitted, see (11.37).) 

The fact that the Fourier transform of the squared envelope of a signal 
is proportional to the autocorrelation function of the Fourier transform for 
positive frequencies of the signal itself has several interesting applications. 
For example, suppose we want to lower the peak factor of a periodic signal 
(defined as its range, i.e. maximum minus minimum value, divided by its root- 
mean-square value), it would of course help if the envelope were a constant 
or at least have as little ac (alternating current) power as possible. Thus, the 
squared values of the autocorrelation sequence ai,a 2 ,as, • • • of the positive 
branch of the Fourier transform of the signal should be as small as possible 
for a given signal power (ao). In other words, the Fourier transform should 
be a low- autocorrelation sequence which, as mentioned before, is a kind of 
sequence that has many applications in signal design for radar and sonar, in 
filter and antenna design, and in other fields. 

Consider a periodic signal consisting of 31 cosine harmonics of equal, unit 
amplitudes. The Fourier transform of the signal is a sequence of 31 terms 
of the form zbl,zbl,zbl,---,±l. Call the corresponding autocorrelation se- 
quence a/c, fc = 0, 1, 2, . . . , 30. Which sign combination has the lowest value 
of A = This is a problem of combinatorial complexity, a nasty 

class of “intractable” problems. Obviously, the solution is not all +l’s (or all 
— I’s), for which choice A would equal 9455. Nor is a random choice optimal: 
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the expected value of A would be 465. Better results for this problem, as for 
other problems of combinatorial complexity (such as the traveling salesperson 
problem), are obtained with genetic algorithms or with simulated annealing^ 
a numerical algorithm inspired by the thermodynamics of slowly cooled ( “an- 
nealed”) metals [11.17]. 

Good results have also been achieved by adjusting the phase angles of a 
periodic waveform to mimic those of an FM signal (which by definition has 
a low peak factor) [11.18]. 

The best results so far come from number theory, specifically the theory 
of finite number fields, also known as Galois fields. As mentioned in Sect. 
10.5.1, Galois sequences, also called maximum- length sequences, are periodic 
pseudorandom sequences whose squared autocorrelation coefficients, a\ for 
A: > 0, are as small as possible, namely equal to 1. The resulting waveform has 
been used by the author as an excitation signal in vocoders, giving a notice- 
ably smoother sound compared to excitation by high-peak-factor (impulsive) 
waveforms. 

In addition to the squared envelope, another important bandlimited func- 
tion derivable from the analytic signal cr(t) = |cr(t)| exp{iip{t)) is the weighted 
instantaneous angular frequency 

u{t) := , (11.44) 

which equals the imaginary part of d<j*. The time integral of v(t) equals the 
frequency integral of uo weighted with the energy spectrum: 

(^(t)|cr(t)pdt = — / Lo\a{uj)\‘^d(jo . (11.45) 

27t Jo 

This intriguing relation between two average frequencies, one averaged in the 
time domain and the other in the frequency domain follows directly from 
Parseval’s theorem (11.14) by setting si(t) = d(t), and 52 (t) = cr(t), and 
equating imaginary parts. Here we have made use of the fact that the Fourier 
transform of &{t) equals iLoa{Lo). Similar relationships exist for higher-order 
odd moments of the energy spectrum. 

It is interesting to note that the weighted instantaneous frequency ^{t) = 
\a{t)\^if{t) has a well-known meaning in astronomy. If we consider the orbit of 
a planet around the sun as an analytic signal a{t) with the sun at the origin of 
the complex plane (the orbital plane), then Kepler’s Second Law of Planetary 
Motion, published in 1609, says that iy{t) is a constant. Planetary motion, 
i.e. the distances of the planets from the sun, considered as analytic signals, 
could not be simpler. While the ancient Greeks and even Nicholas Coperni- 
cus (1473-1543) and, initially, indeed Johannes Kepler (1571-1630) himself 
held fast to the dogma of constant speeds of the planets, observation taught 
him otherwise. The constancy of ^{t) was later explained by Isaac Newton 
(1642-1727) as the preservation of angular momentum of a (heavenly) body 
subjected only to central forces, as is the case for a planet attracted by the 
sun. 
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For Gaussian processes, the squared envelope and the instantaneous fre- 
quency are independent in the sense that, for T oo, 

^ ~ J . (11.46) 



This means that the center of gravity (or first moment) of the spectrum of a 
Gaussian noise can be determined by merely averaging its instantaneous fre- 
quency (p{t) without regard to its amplitude. (The same is of course trivially 
true for a single sine wave.) By contrast, as already mentioned in relation 
to autocorrelation functions, the rate of zero crossings of a Gaussian noise 
measures the second moment of the power spectrum. 

There is an interesting application of this result to Gaussian processes in 
the frequency domain, such as the real (or imaginary) part of the transfer 
function of a large room or any other linear system with randomly overlapping 
normal modes [11.20]. To determine the reverberation time Tq of the system, 
if defined as the center of gravity of its squared impulse response, it suffices 
to measure the average rate of phase change with frequency. In fact. 



U2 - LJi 



(11.47) 



For exponential decays, Tq is also the time in which the reverberation energy 
decays by a factor e = 2.718 • • •. The phase difference — ^{ 0 ^ 2 ) must in- 
clude the appropriate multiple of 27 t accumulated over the frequency interval 
(<^ 1 ,^ 2 ) being considered [11.21]. 

Curiously, the relation (11.47) is also true for a single echo with a delay 
Te rather than full-blown reverberation. (In fact, (11.47) is trivially true in 
the case of a single echo.) 



11.8 Causality and the Kramers-Kronig Relations 

One of the most important applications of the Hilbert transform is the estab- 
lishment of the relationship between real and imaginary parts of the Fourier 
transform of a causal system. A causal system is defined as a system whose 
impulse response h{t) is zero for negative times (no output before an input 
has been applied): 

h{t) = 0 for t < 0 . 

This causality condition is reminiscent of the Fourier transform of an 
analytic signal, which vanishes for negative frequencies: a{uj) = 0 for a; < 0, 
see (11.37). Since real and imaginary parts of an analytic signal are related 
by a Hilbert transform, an analogous Hilbert transform relationship may be 
expected between the real and imaginary parts of the Fourier transform of a 
causal signal. Indeed, for a causal system with impulse response /i(t). 
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Im{A(a;)} = — J 



Re{h{a)} 

-oo 7 t(u; - i?) 



df? 



and 



Re{/i(u;)} = J 



7t{uJ — i?) 



di? , 



(11.48a) 



(11.48b) 



which, exploiting the symmetry h{—u;) = can also be written 



2 

Im{/i(o;)} = / 

^ Jo 



ujRejhm 



and 



Re{h{oj)} 



f 



f?2-u;2 



(11.49a) 



(11.49b) 



These equations are called Kramers-Kronig relations after the physicists 
Hendrik A. Kramers and Ralph de Laer Kronig who established them in 
the 1920s in connection with the dispersion of light waves by atoms and 
molecules. Light dispersion was an important piece of experimental evidence 
in the budding theory of atomic structure. In fact, Werner Heisenberg (1901- 
1975) was inspired by these very concepts to emphasize the crucial role of 
observable physical quantities, called observables, in his formulation of quan- 
tum mechanics in 1925. The new theory banished unobservable quantities, 
insinuated into atomic physics by such misleading macroscopic images as 
electronic orbits around an atomic nucleus (as depicted in the logos of some 
nuclear enterprises and the Getty oil company before they replaced it by 
a yellow dot - presumably the sun). Instead of ill-dehned electronic orbits, 
physicists substituted wave amplitudes and relative phases and other actually 
measurable quantities to describe submicroscopic events. 



11.8.1 Anticausal Functions 

Occasionally the need arises to consider anticausal functions, dehned by 
g{t) = 0 for t > 0. The simplest example of an anticausal system is a negative 
delay whose impulse response is given by 

g{t) = S{t -to), to<0. (11.50) 

The Fourier transform is g{io) = exp{—iouto) = cos{ujto) + isin(— cjto), in 
which the imaginary part is the Hilbert transform of the real part rather 
than vice versa. 

Anticausal functions occur in the physics of elementary particles and in 
tape-recorded audio signals in which the tape is run backward in time. If 
time-reversed speech is heavily reverberated and time-reversed once more 
(so that the speech signal is running forward again) it becomes completely 
unintelligible. Human auditory perception suppresses normal reverberation 
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by the acoustic precedence effect According to the precedence effect (also 
known as, but actually distinct from, the “Haas effect”), the direction of the 
first-arriving sound determines the perceived direction, even in the presence 
of strong echoes. The precedence effect has probably evolved in our animal 
ancestors to signal the proper direction of a predator or prey in the pres- 
ence of strong reverberation from trees or inside caves. On the other hand, 
and not surprisingly, the human ear is unable to cope with time- reversed 
reverberation, an anticausal process that does not occur in airborne sound. ^ 

11.8.2 Minimum-Phase Systems and Complex Frequencies 

Another important application of the Hilbert transform is the relation 
between the amplitude response \h{Lo)\ and the phase response (f{(jv) of 
minimum-phase systems. A minimum-phase system is defined as a causal 
linear passive system with transfer function h{u;) whose inverse transfer func- 
tion l/h{uj) is also causal. 

Minimum-phase systems are best discussed in the complex frequency 
plane^ in terms of the complex frequency variable 

s \= p-\-\uj . (11.51) 

Causal systems that are square-integrable are analytic in the upper halfplane 
of the complex frequency plane. Thus, neither h{(jj) nor l/h{uj) can have poles 
in the upper halfplane. Since the poles of l/h{uj) are the zeros of h{uj)^ we 
can characterize a minimum-phase system by a transfer function h{uj) that 
has neither poles nor zeros in the upper half of the complex frequency plane. 

For such a transfer function h{uj)^ its logarithm log[^(o;)] is also an analytic 
function in the upper halfplane. (Note that for this to be true, it is essential 
that h{Lu) have neither poles nor zeros in the upper halfplane because the 
logarithm diverges for both poles and zeros.) From this property of log[^(c(;)] 
it follows that its imaginary part is the Hilbert transform of its real part. 
With 

h{uj) = |/i(u;)| exp(i(^(o;)) 



or 

log[^(w)] = log[|/i(w)|] + , 

we therefore have 

7T Jo UJ- Q 



^However, time-reversed reverberation does occur in the deep ocean over very 
long distances. In this so-called SOFAR underwater sound channel, say between 
South Africa and Greenland, the strongest sound, which travels in a straight line, 
arrives last because the center of the channel has the smallest sound velocity. 



(11.52) 

(11.53) 
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Conversely, the logarithmic amplitude response log[|/i(ct;)|] is the Hilbert 
transform of the phase response (p{uj). Thus, for minimum-phase systems, 
amplitude and phase response are closely bound to each other. For every am- 
plitude response there is a unique minimum-phase response and vice versa. 

A simple example of a minimum-phase system is a capacitor, charged up 
at time t = 0, whose charge is decaying exponentially for t > 0 with time 
constant r. Its impulse response is 



h{t) = 



0 for t < 0 

exp{—t/r) for t > 0 



with Fourier transform 

h{(jj) = - — — . 

1 + lUJT 

Its phase response 



(11.54) 



(11.55) 



= — arctan(a;r) 

is indeed the negative Hilbert transform of 



log[|ftH|] = llog(^^^ 



(11.56) 



(11.57) 



A broad class of minimum-phase systems are so-called all-pole systems which 
are defined as systems having only poles (in the lower half of the frequency 
plane to be stable) and no zeros. Such systems, as noted above, have causal 
inverses l/h{cj). 



11.8.3 Allpass Systems 

Another important class of linear passive systems are allpass systems, defined 
as systems whose transfer function h{u;) has a constant magnitude |^(c<;)| 
equal to 1, that is 

h{uj) = (11.58) 



or 

h{u;)h*{u;) = 1 . 

For an allpass system, we therefore have 

h-\uj) = h*{w) . (11.59) 

Allpass systems, too, are best discussed in terms of the complex frequency 
variable 5 = p + icj. Thus, the frequency ((j) axis becomes the imaginary axis 
in the complex s plane and we write /i[5] instead of using brackets to 
indicate the change of variable. Introducing the poles pk and zeros Zk of /^[s], 
we write 
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S Z}z 



= n — 

, s-Pk 



(11.60) 



Applying (11.59), we obtain for s = io; (i.e. on the frequency axis, where 
s* = —s): 

s + 4 



h Ms] == n 



or 



h[s]=U 



k 
s 






(11.61) 



(11.62) 



Comparing (11.62) with (11.60), we see that for an allpass system the zeros 
are given by the poles (and vice versa), namely: 



= -Pk • 

Writing Pk = Pk , we have 



(11.63) 



— —Pk + 5 



(11.64) 



that is, the zeros are the mirror images of poles, mirrored in the a;-axis. Thus 
each pole (“resonance”) is compensated in its amplitude effect by a mirror- 
image zero, but the phase shift engendered by each pole is doubled by its 
mirror- image zero. 

With (11.60) and (11.63) we may write the transfer function of an allpass 
system as 



%]=n 

k 



s-Pk 



(11.65) 



The impulse response of the inverse of a causal allpass system is anticausal. 
That is, the inverse of an allpass system has an impulse response that is the 
original impulse response h(t) mirrored in time. This follows from the fact 
that = /i*(cc;), see (11.59), and that h*(cc;) is the Fourier transform 

of h{—t). Thus, a causal allpass system has no causal inverse; in fact, the 
inverse is an^mausal. 



11.8.4 Dereverberation 

Anticausality can be circumvented by the method of tape recording the signal 
(or putting it into computer storage) and running the tape backward or 
reading out the memory in reverse. (Of course, these operations take time so 
no causality laws are actually violated.) 

To realize the inverse of a general linear passive system h{uj), it 

must first be factored into a minimum-phase system m{uo) and an allpass 
system: 

h{uj) — m{uo) a(uj) , 



( 11 . 66 ) 
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whose inverse is 

h~^ {uj) — m~^ (lu) a~^ (lo) . (11.67) 

Here, exists and has to be realized (or approximated) in 

nonreal time by a technique that flips the time axis. In this manner, even 
the exceedingly complex response of a large lecture hall or concert hall can 
in principle be inverted with a delay of about 2 s. Such an inverse Alter, 
used on the acoustic output of a reverberant enclosure, can eliminate the 
reverberation and restore the original unreverberated sound. This process is 
an example of dereverberation. Dereverberation improves speech quality and 
intelligibility, for example in “hands-free” speakerphones. Such applications 
require real-time dereverberation, which can be realized (apt word!) by self- 
steering microphone arrays [11.22] and volume focussing arrays that “focus” 
on a given sound source in three-dimensional space [11.23]. 



11.9 Matched Filtering 

Inverting the time axis of the impulse response h{t) of an arbitrary system 
or filter results in the so-called matched filter whose Fourier transform is 
given by h*{uj). (Again, an extra delay is required to make the matched Alter 
realizable.) Matched Alters are often used in signal detection to extract useful 
signals from a contaminating noise. If the Fourier transform of the signal is 
h{(jj)^ then the output of the matched Alter is 

h*{w)-h{u}) = \h{uj)\^ , (11.68) 

which is the energy spectrum of the signal. The corresponding inverse Fourier 
transform is the autocorrelation of the signal. In a manner of speaking, 
matched Altering turns back to zero all the phase angles (p{uj) in the sig- 
nal’s Fourier transform h{u;) = \h{u;) \ exp(i(^(u;)) to result in a large response 
peak at t = 0. At the same time, matched Altering squares the magnitude 
\h{(jj)\^ see (11.68), thereby emphasizing those frequency components that are 
relatively strong compared to the noise spectrum (which is considered flat in 
the simplest case). 

The concept of matched Altering was Arst formalized by Dwight O. North 
in a 1943 classiAed report. The name “matched Alter” was coined by D. 
Middleton and J.H. Van Vleck a year later in another classiAed report. 

Two-dimensional matched Altering has found wide application in optical 
recognition systems. For optical applications, the matched Altering is often 
carried out by two-dimensional signal processing using coherent (laser) light, 
proAting from the fact that a focussing lens performs a good approximation to 
a Fourier transform between its two foci [11.24]. For the properly recognized 
object, the result of the matched Altering is an easily detected bright spot at 
the origin of the correlation plane, while no such spot occurs for the incorrect 
objects. 
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Matched filtering applied to the outputs of an array of microphones, hy- 
drophones, or other sensors in a multipath medium (such as the oceans) 
causes a three-dimensionally bounded space to be selected, resulting in 
“volume-focussing” rather than mere beamforming. 

Matched filtering is closely related to the so-called factorization problem: 
given a square-integrable nonnegative function d{uj), find a causal function 
of time h{t) such that d{uj) is the Fourier transform of its autocorrelation 
function 



h{t) ★ /i*(— t) = a{t) 



(11.69) 



or 

h{uj)h*{uj) = d{(jj) 



(11.70) 



To make the solution unique, we require that h{t) be the impulse response 
of a minimum-phase system. 

The factorization problem as expressed in (11.69) has a solution if and 
only if a{(jj) satisfies the Paley- Wiener condition [11.25]: 




logg(c4;)| 

1+^2 



do; < oo . 



(11.71) 



Thus, a{u;) cannot vanish on any interval or set with nonzero measure. 



11.10 Phase and Group Delay 

Consider a signal s{t) consisting of two cosine waves with different angular 
frequencies: 

s{t) = COs{LOit) COs{uJ2t) , LJ 2 > (^1 • (11.72) 

How are envelope and phase of s{t) affected if the signal travels through a 
filter or medium with a frequency-dependent phase 

Using a Taylor expansion around cj = oji and truncating after one term, 
we have 

ip{u)) = (p{coi) - (u>- L0i)(p'(u>i) , (11.73) 

where (p^(coi) is the negative derivative of the phase with respect to angular 
frequency uj — ooi. 

The squared envelope of s{t) 

|cr(t)p = 2 -h 2cos(o;2 - u;i)t (11.74) 

fluctuates or “beats” with a frequency C 02 — equal to the difference of 
the two frequencies contained in the signal. If s{t) is subjected to a phase 
distortion (11.73), the squared envelope becomes 

|cTc^(t)P = 2 -h 2 cos((cj 2 - ^l){t - (p')) . 



(11.75) 
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For “normal” phase distortion, also called normal dispersion by physicists, 
the phase decreases with increasing frequency. The derivative (/?', as defined 
in (11.73), is therefore positive. Thus, (11.75) tells us that the envelope is 
delayed by an amount 

= ip' . (11.76) 



This delay is called envelope delay or group delay (because it affects a “group” 
of frequencies, not a single frequency). Group delay, which generally depends 
on frequency, is an important concept, especially in two-way communication 
links. For example, the smooth flow of a telephone conversation is severely 
degraded if the round-trip group delay exceeds half a second, as it does in 
connections that use geo-stationary satellites. Such delays are particularly 
bothersome in the presence of delayed echoes of one’s own voice from a distant 
location. 

Distance traveled divided by group delay is called group velocity. Accord- 
ing to special relativity, the velocity of light in free space, about 300 000 
km/s, is the upper limit for the group velocity of any physically realizable 
system.^ Thus, energy or information (which, for AM signals, resides in the 
envelope variations) cannot travel faster than light. (Yet in quantum mechan- 
ics there are situations, the Einstein-Podolski-Rosen paradox being the most 
celebrated one, in which instantaneous actions at a distance do take place: 
Einstein’s “spooky actions at a distance” . But they cannot be exploited for 
transmitting energy, let alone transporting people (“teleportation”) at the 
speed of light [11.26].) 

The phase delay At^, introduced by a phase-distorting medium, is defined 
as 



At^ := 






(11.77) 



The phase delay determines the phase shift in the instantaneous (also called 
“carrier”) phase of the signal. Thus, a signal with envelope \cr{t)\ and instan- 
taneous phase p{t): 



s{t) = |cr(t)| cos{p{t)) (11.78) 

emerges from the phase-distorting medium as 

Sy,{t) = \a{t - Atg)\ cos{p{t) - uiAt^) . (11.79) 

Equation (11.79) implies that the envelope of the signal emerges from the 
phase-distorting medium unscathed except for a delay. That is strictly true 
only for frequency intervals for which the truncated Taylor expansion (11.73) 
is a perfect approximation. If the second derivative is too large in 

magnitude, the shape of the envelope will be distorted; for example, a sharp 



^In 1983 the General Gonference on Weights and Measures redefined the meter 
to make the speed of light c come out an exact integer (containing a surprisingly 
large prime factor) in meters per second; c = 299792458 m/s. 
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pulse will become broader because the different frequencies it contains will 
suffer different group delays. The condition for envelope shape preservation 
is 






LJ2 — 



(11.80) 



In other words, the relative change in the group delay \{uj 2 in the 

frequency interval uo 2 — niust be small. This is an important consideration 
in the design of optical fiber links. 

Distance traveled divided by the phase delay is called phase velocity. 
Phase velocities can exceed the speed of light because they engender no 
“extraluminous” causation. A simple example of an arbitrarily high phase 
velocity is the bright spot projected on a high cloud cover by an earthbound 
search light. When the search light is whipped to a different direction, that 
bright spot can travel along the cloud cover faster than light. Another in- 
stance of a phase velocity exceeding the speed of light c is the phase velocity 

in a metallic wave guide. In fact, for electromagnetic wave guides, group 
and phase velocity are reciprocally related: = (? I Since can not 

exceed c {vg < c), it follows immediately that v^p > c. 

What is true of guided waves is also true for freely propagating waves: if 
a plane wave traveling at speed c hits a plane surface at an angle of incidence 
a from the plane’s normal, then wave crests travel along the surface with a 
speed equal to c/ sin a which, except for grating incidence {a = |), exceeds 
c. 



11.11 Heisenberg Uncertainty 
and the Fourier Transform 

The Fourier transform has an interesting scaling property. As is evidenced 
from its definition (11.2), a compression of the time axis by a factor k causes 
a dilation of the frequency axis by the reciprocal factor, 1/k. Indeed, if 

/ oo 

s{t) exp(— io;t) dt (11.81) 

-oo 

then, for Sk{t) := s{kt)^ the Fourier transform is 

SfeM = (l) ■ ( 11 - 82 ) 

This reciprocal scaling relationship between time and frequency is the reason 
why time-bandwidth products are often a more fundamental quantity than 
duration or bandwidth of a signal by itself (for example, in signal detection 
tasks detection probabilities do not depend on scaling). 

Another implication of reciprocal scaling is that the smaller the time 
uncertainty the larger the frequency uncertainty will be (and vice versa). 
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For example, a rectangularly gated sine wave of duration T has an angular 
bandwidth between the first two spectral zeros around the spectral peak 
equal to 47 t/T. In other words, the time-bandwidth product equals 47 t. Are 
there pulse shapes which have smaller time-bandwidth products? And what 
are these shapes, called time windows by electrical engineers? The answer 
depends on the definition of duration and bandwidth. If we use the standard 
deviation a (familiar from statistical analysis) as a measure of width, then 
we may ask for the pulse shape for which the product at • is a minimum 
(provided at and a^ exist). For a pulse in the shape of a Gaussian distribution 

s{t) — exp(-t^/2(j^) , (11.83) 

the Fourier transform is also Gaussian: 



s{ijj) = 27Tcrexp(— c<;^(7^/2) . (11.84) 

By squaring (11.83) we obtain the power of the Gaussian pulse whose 
standard deviation at is seen to equal aj\/2. Similarly, by squaring (11.84), 
we obtain the power in the frequency domain whose standard deviation a^^ 
equals \j\f2a. Thus, the time-bandwidth product of the power distributions 
in the time and angular frequency domains, as measured by their standard 
deviations, equals at-a^=^.hf 

In 1927 the German physicist Werner Heisenberg showed that the Gaus- 
sian shape has the smallest uncertainty, that is 



CTt ■ CTuj > 0.5 . 



(11.85) 



This is the justly famous uncertainty relation^ except that physicists associate 
with the frequency uj an energy E = where 2'kTi — 6.6260755 x 10“^^ 
watt • second^ is called Planck constant. (In quantum mechanics, hu) is the 
energy of a photon that “accompanies” an electromagnetic wave of angular 
frequency oo. Its angular momentum, called spin, equals ±h.) More generally, 
physicists discovered in the mid- 1920s that, in quantum mechanics, certain 
conjugate variables, such as position x and linear momentum p, are (except 
for a factor h) conjugate Fourier variables, just like time and frequency for 
speech signals. Consequently, the uncertainty relation (11.85) applies to any 
such pair of variables. For example, for x and p we have 



Ti 



( 11 . 86 ) 



Because linear momentum p equals mass m times velocity v, equation (11.86) 
also implies a velocity uncertainty ay = a^jm. For a^ = 10~^°m (the diame- 
ter of the hydrogen atom), the velocity uncertainty of an electron (rest mass 

^In quantum mechanics the squaring of the amplitude distributions (11.83) and 
(11.84) corresponds to going from the Schrodinger wavefunction ^ to its absolute 
square, which (ever since Max Born’s proposal) is interpreted as a probability 
distribution. It is for these probability distributions that the Heisenberg uncertain- 
ties are defined. 




234 



11. Basic Signal Concepts 



m ^ 9.1 X 10~^^ kg) equals roughly 10^ m/s. (No wonder we can no longer 
speak meaningfully of orbits on atomic size scales.) 

Another pair of conjugate variables is energy E and time t for which the 
uncertainty relation is 

(11.87) 

People usually think of a vacuum as being totally devoid of matter and en- 
ergy. But such slipshod thinking is forbidden by the uncertainty relation: if 
the energy is precisely zero, then so is its fluctuation, but = 0 would 
violate (11.87). Thus, even the vacuum has energy - with numerous testable 
consequences. In fact, our entire universe may have started from a vacuum 
fluctuation run amok. As Thomas Cranmer, archbishop of Canterbury, put it 
so aptly in 1520, “Natural reason abhorreth vacuum.” (And as John Pierce, 
who gave the transistor its name, remarked 428 years later “Nature abhors 
vacuum tuhes.^^) 

One of the most accurate phenomena in all of physics is the Mossbauer 
effect in which electromagnetic radiation in the form of gamma quanta is 
emitted with a relative energy uncertainty smaller than 10~^^. (This corre- 
sponds to a Q-factor, familiar to electrical engineers, exceeding 10^^.) This 
small energy uncertainty leads to a very large time uncertainty and therefore 
to a very large duration of the associated Schrodinger wave, namely more 
than 10^^ periods! Of course, if this wave is chopped into shorter pieces by 
a kind of camera shutter in the course of a Mossbauer experiment, the en- 
ergy uncertainty will increase according to (11.87). This is an inescapable 
conclusion which, however, even some Mossbauer experts refused to believe 
because, they argued, “a laterally moving shutter cannot possibly change the 
energy of the quanta passing through it”. Well, as Richard Feynman once 
said, nobody understands quantum mechanics (including, it seems, quantum 
physicists) . 

11.11.1 Prolate Spheroidal Wave Functions and Uncertainty 

Instead of expressing uncertainties in terms of standard deviations, as Heisen- 
berg did, other circumstances require different criteria. For example, an en- 
gineer may want to design a pulse shape or time window in such a way that 
for a given total duration of the window, a maximum amount of the total 
energy should fall within a given bandwidth. In other words, the spectral 
energy “splatter” should be kept at a minimum. What shape is the optimum 
window? The answer, given by David Slepian, Henry Landau and Henry Pol- 
iak, is prolate spheroidal wave functions [11.27]. These functions have the 
remarkable property that they are orthogonal to each other both over an in- 
finite and a finite range of the independent variable. And, like the Gaussian 
distribution, each prolate spheroidal wave function (pn{t) is proportional to 
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its own Fourier transform, except that the proportionality extends only to a 
finite interval. Thus, 






T \ B 

t = for \uj\< — . 



( 11 . 88 ) 



Here T is half the total duration and 5 is a bandwidth. 

Prolate spheroidal wave functions, as the name implies, are solutions of 
the wave equation in prolate spheroidal coordinates. A prolate spheroid re- 
sults from an ellipse rotated about is long axis. An airborne blimp is an ap- 
proximate prolate spheroid. (When an ellipse is rotated about is short axis, 
an oblate spheroid results. The surface of the earth is an approximate oblate 
spheroid, that is a sphere flattened at the poles and bulging at the equator 
due to centrifugal forces.) 

The basic properties of the prolate spheroidal wave functions follow 
from the fact that they are solutions of the integral equation 

, sin Bit — x) , , , . / ^ X 

77 , (11.89) 

for certain values of A = A^, called eigenvalues. There are infinitely many 
eigenvalues, all real, positive and smaller than 1: 



1 > Ao > Ai > • • • > An > • • • > 0 . 



(11.90) 



The corresponding eigenfunctions ipn (x) form a complete orthonormal set on 
the interval (— oc, oo): 



= S„ 



(11.91) 



where Snm is Kronecker’s delta symbol defined as 1 for n = m and 0 otherwise. 
By changing the time scale, the limits of integration ±T in (11.89) can be 
made equal to dil, the limits used in the standard definition. The parameter 
B in (11.89) then changes to BT. Thus, the prolate spheroidals depend only 
on one parameter, the time-bandwidth product BT. 

Equation (11.89) has the form of a convolution integral. The convolution 
kernel sin{Bt) / nt represents a sharp lowpass filtering process in the frequency 
domain. Thus, (p{t) in (11.89) is a lowpass function with no energy at angular 
frequencies outside the interval (— B, B). Because the (pn{t) form a complete 
set of orthonormal functions, bandlimited functions s{t) with the same band- 
width can be expanded in terms of the 



s{t) = '^a-n^Pnit) 



(11.92) 



n=0 



where 



/ oo 
-oo 



(11.93) 
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The integral equation (11.89) has an interesting interpretation. It says that 
gating (pn{t) at ±T with a rectangular window will reproduce (pn{t) (except 
for a factor An) when the spectral “splatter” resulting from the gating is 
removed (by lowpass filtering). 

Considering the (pn as functions of frequency their inverse Fourier 

transforms are time-limited. Any bandlimiting of a Pn{^) will of course cause 
“time splatter,” but within the original time interval the inverse Fourier trans- 
form will be reproduced, except for an attenuation by a factor An- 

The most important property of the Pn{t) is that they are not only orthog- 
onal in the infinite interval (— oo, cx)) but also in the finite interval (— T, T). 
In fact, 



/: 



= XnS, 



n^nm • 



(11.94) 



This orthogonality can be deduced from (11.89) by setting p{t) = Pn{t) and 
A = An, multiplying by Pm{t) and integrating over all t. Changing the order 
of integration gives, with (11.91), 

sinB(t — x) 



/>„(«) £ 



Pm (t) ■ 



dtdx = XnSn 



(11.95) 



7r{t — x) 

Here the integral over t equals Pm{^) (because bandlimiting the bandlimited 
function Pm{t) simply reproduces it) and (11.95) is thus seen to be identical 
with (11.94). 

The main application of the prolate spheroidal wave functions is the de- 
sign of bandlimited signals with a maximum energy fraction in a given time 
interval. Given a bandlimited signal s(t), we can expand it into the properly 
scaled pn{t)^ see (11.92). The total energy E of the signal equals 



/ oo QQ 

s^{t)dt = ( 

n=0 



The fraction Et falling into the time interval (— T, T) follows from (11.94) as 

cT 



Et := 



/: 



^{t)dt 



oo 

n=0 



Thus, the energy fraction a := Et/E is given by 



a = 






Since Aq is larger than any other An, a is maximized by setting all an except 
ao equal to 0. Thus, 



dr\ 



= Ao , 



where Aq depends on the time-bandwidth product TB. For example, for 
TB = 1, a 0.6. If a is required to be as high as 0.95, then TB must be 
about 3 [11.28]. 
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Similarly, if we wish to maximize the energy fraction of a time-limited 
signal s{t) in the frequency range we expand s(t) in terms of the 

Fourier transforms (pn of the which, properly scaled, vanish for —T <t< 
T: 

oo 

'^(^) ~ ^ ^ ^nPn 
n=0 

The energy Eb inside a given frequency interval {—B^B) is given by 




and the energy fraction f3 = EsjE equals 
Z^n=0 

As before, the energy fraction is maximized by setting all bn, except ho, 
equal to 0. Thus, the optimum energy fraction hmax in the frequency band 
-B <C (jj <i B equals Ag. 

How close does the famous Hamming window 

s{t) = 0.54 + 0.46 cos ^ ^ (11.96) 

s{t) = 0 otherwise 

come to optimizing the energy fraction in the frequency range \co\ < tt/T? 

In some problems, limits are imposed on both a and (3. In such cases 
the optimum pulse shape is a linear combination of po{t) and a time-gated 
Given the time-bandwidth product TB and thus the eigenvalue Aq, 
a noteworthy trading relation between a and j3 rules the realizable energy 
concentrations: 

arccos ^/a + arccos == arccos \/^ • 

The largest possible product of energy fractions, a[3, is obtained for o; = /3 
resulting in 

(Q^/^)max=( ^ 1 . 

For Ao = 0.6 {BT ^ 1), this gives (o;/?)max = 0.787. . ., a remarkable degree 
of energy concentration. By comparison, for the Gaussian window, setting 
T — at and B = a^j = 1/^^t, the product (a/?) is only about 0.466. 
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11.12 Time and Frequency Windows 



As we remarked before, the Gaussian pulse shape of infinite length is opti- 
mum for minimizing the time-bandwidth product if measured by standard 
deviations. But signals in the real world have finite durations and a gated 
Gaussian is often not a good choice. Also, any discontinuity in a signal re- 
sults in a power spectrum proportional to for large oj. This means that 
the standard deviation, which is based on the second moment of the power 
spectrum, does not exist. Thus, other criteria for designing pulse shapes and 
windows are needed. 

One such criterion is convenience, and the most convenient procedure is 
simply to limit the signal to a finite window interval (— T, T) by gating with 
a rectangular function: 



5r 




1 

0 



for \t\ < T 
otherwise . 



The corresponding Fourier transform 



Sr(o;) = 2 



sina;T 

UJ 



has so much spectral energy splatter (the second moment of the energy spec- 
trum is infinite!) that the rectangular window is unacceptable in many ap- 
plications. 

More generally, any discontinuity in the signal has a Fourier transform 
that, asymptotically, is proportional to 1/a; and thus decays relatively slowly 
with increasing frequency (namely, with 6 dB/octave). This follows from the 
fact that a discontinuous function, when differentiated, has delta functions 
at the discontinuities which have flat Fourier transforms (independent of fre- 
quency) and the inverse operation, integration, multiplies the Fourier trans- 
form by 

Similarly, a continuous signal with a discontinuous first derivative has a 
Fourier transform that, asymptotically, is proportional to l/o;^. In general, 
a continuous signal that has continuous derivatives up to and including the 
nth derivative has a Fourier transform that, asymptotically, is proportional 
to This law also applies to discontinuous functions whose integral is 

continuous if we count an integration as a “negative” differentiation (n = — 1). 

Another simple shape is the triangular window 



SA{t) = 



T - |t| for \t\ < T 
0 otherwise . 



Because a triangular window of length 2T is the convolution of two rectan- 
gular windows of length T, its Fourier transform is the square of that of such 
a rectangular window: 



5 a (ca) = 4 



sin^ iujT/2) 
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For the triangular window the standard deviation of the energy spectrum 
does exist and equals = VS/T. Together with the standard deviation 
in the time domain = T/^/lO^ the uncertainty product for the energy 
equals ^3/10 ^ 0.548, which compares quite well with the energy 
uncertainty product 0.5 for a Gaussian window. 

Still another simple window is the raised-cosine window called “Hanning” 
(after Julius von Hann, 1839-1921) by J. W. Tukey [11.29] and which should 
perhaps be called Tukey window in honor of the statistician who gave us 
such enduring (and endearing) terms as bit, byte, cepstrum and quefrency. 
The Hanning/Tukey window 

. . _ f 0-5 + 0.5cos(7Tt/r) for |t| < T 
^ 1 0 otherwise 



has a Fourier transform 



ST{t) 



7T^ sin(a;T) 
w(7t2 - w2T2) 



with a standard deviation of it energy spectrum = ttv^/T. 

Another well-known window was designed by R. W. Hamming and is 
now called Hamming window, see (11.96). Hamming added a “pedestal” to 
the raised-cosine window to minimize the magnitude of the largest spectral 
sidelobe. 

The Fourier transform of the Hamming window, 






(1.Q87t^ - O.ier^u;^) sin(u;r) 
o;(7r^ — 



has an infinite second spectral moment (because of the discontinuities at 
\t\ =T), but its largest sidelobe is down by more than 43 dB from the main 
spectral peak (as opposed to 31 dB) for the Hanning/Tukey window. 



11.13 The Wigner-Ville Distribution 

One way to tackle the time-frequency resolution problem is the Wigner-Ville 
distribution (WVD) of a signal s{t): 

w {t, ^ / s* (^ “ 0 ^ (* + ^) exp(-iwr) dr , 

which depends both on time and frequency [11.30]. 

The WVD was originally proposed around 1932 by the Hungarian physi- 
cist Eugene P. Wigner in connection with the new quantum mechanics and 
Heisenberg’s uncertainty relation [11.31]. It has the following highly desirable 
properties for signal and system analyses: 

Integrating W{t,uj) over frequency gives the signal energy density as a 
function of time. Integrating W{t,uj) over time gives the energy density as a 
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function of frequency. Thus, W (t, lu) contains both the full, unsmoothed time 
and frequency information. It is a kind of precursor to the Fourier transfor- 
mation in that a simple projection onto the frequency axis (integrating over 
time) yields the spectral energy distribution. In fact, W{t,uj) can be inter- 
preted as an energy distribution in time and frequency - somewhat like a 
running spectrogram (“voice print”), yet without the application of a win- 
dow function. The Fourier transformation of W (t, co) with respect to t yields 
(to within a constant factor) the Fourier transform of the signal. Similarly, 
the Fourier transform of with respect to u recovers the signal (to 

within a constant factor). Thus, the WVD does not discard any informa- 
tion; both the signal and its Fourier transform can be completely recovered. 
It is interesting to note that the WVD can also be defined by an integral 
over frequency (instead of time) in which the signal is replaced by its Fourier 
transform. Thus, the WVD “plays no favorites;” it treats time and frequency 
on an equal footing [11.32]. 

Other helpful properties of the WVD are that if the signal s{t) has 
bounded support, i.e. vanishes outside a finite time interval, then so does 
the WVD (for the same time interval). Similarly, if the Fourier transform of 
the signal has bounded support in frequency space, so does the WVD (for the 
same frequency interval). Hence, the WVD does not cause spectral or tem- 
poral “splatter.” If the signal is the product of two time functions, its WVD 
is the convolution of the two original WVDs with respect to the frequency 
variable. If the signal is a convolution of two time functions, its WVD is a con- 
volution with respect to its time variable. The time course of instantaneous 
frequency, of a signal a{t)exp(i(p{t)) is given by the mean frequency 

u; obtained with W (t, lj) as a weighting function. The frequency-dependent 
centroid of the signal, obtained with W{t^Lo) as a weighting function, is 
given by the (negative) derivative of the Fourier phase, The WVD has 

been widely used in system analysis, especially loudspeaker responses [11.33]. 
Various discrete-time forms of the WVD are discussed in [11.34]. 

The WVD is closely related to another function of time and frequency. 
It is in fact the Fourier transform of the “ambiguity function,” known from 
radar technology. During World War II the Allied air forces used aluminum 
foil strips (“chaff’), tuned to the wavelength of the enemy radar, to “swamp” 
their monitors with false targets. Thereupon the enemy started using chirped 
pulses in their radars, which - courtesy of the Doppler effect - allowed them 
to distinguish fast moving objects (planes) from the stationary chaff. How- 
ever, chirp pulses introduce an undesirable ambiguity between target range 
and target velocity. Thus, the search was on for radar signals with a low ambi- 
guity (which a chirp lacks) . The ideal ambiguity function is the “thumb-tack” 
function, a function that has only a single high peak in the time- frequency 
plane. It is interesting to note that some of the best ambiguity functions are 
supplied by number theory [11.13]. 
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11.14 The Cepstrum: 

Measurement of Fundamental Frequency 

One of the most difficult problems in speech analysis is the accurate measure- 
ment of the fundamental frequency, especially if the Fourier component at 
the fundamental frequency itself is missing, as is often the case in telephone 
signals. The problem is aggravated by the fact that speech synthesis requires 
an accurate reproduction of the natural fundamental frequency. While the 
human ear is somewhat tolerant to errors in the formant frequencies (and 
relatively insensitive to formant bandwidth), even minute deviations from 
the pitch contour of natural speech can be quite objectionable. Often syn- 
thetic pitch contours are too smooth compared to real speech. But intro- 
ducing randomness artificially into the pitch almost always reduces speech 
quality rather than improving it. Natural pitch does fluctuate, but it does 
so in a nonrandom manner resulting from subtle linear and nonlinear inter- 
actions between the vocal cords and the vocal tract. In speech compression 
systems that include a pitch parameter, accurate measurement of successive 
fundamental periods is therefore a prime requirement. 

Thus it is not surprising that a great many schemes for extracting pitch 
from running speech - even in the absence of the fundamental frequency 
component - have been proposed and tested. The simplest among these are 
the so-called peak pickers that determine the location of the prominent peaks 
of the signal or its envelope, hoping that they reflect successive pitch periods. 
But even if they do, formant-frequency shifts affect the location of these 
peaks within the fundamental period, leading to a “quavery” voice in the 
resynthesized speech signal. 

Another approach to fundamental- frequency measurement is the short- 
time autocorrelation function (ACF) of the speech signal. For a periodic 
signal, the ACF has a large maximum at a delay of one pitch period. But, 
unfortunately, there are secondary maxima in the ACF that reflect the for- 
mant frequencies. During rapid phoneme transitions, these formant peaks 
sometimes exceed in magnitude the pitch-period maximum, leading to size- 
able errors. It is in fact this interference from the formants of speech that 
bedevils most pitch extractors. But how can we suppress the formants to 
facilitate pitch extraction? 

At very high signal-to- noise ratios, inverse filtering would remove the for- 
mant resonances and yield the glottal waveform, which is of course ideal for 
pitch measurements. 

At more realistic signal-to-noise ratios, spectrum flattening can reduce the 
influence of the formants on the spectrum. Spectrum flatteners were origi- 
nally proposed in connection with voice-excited vocoders (VEVs) to generate 
excitation signals for the high speech frequencies from a baseband by means 
of nonlinear distortion [11.35]. A very effective method of nonlinear distor- 
tion for spectral flattening uses center clipping. (While peak clipping does 
little harm to the formant structure, severe center clipping effectively remove 
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the formants.) Effective center clipping depends of course on a well-chosen 
adaptive clipping threshold. Another method for emphasizing the peaks of a 
signal, one that doesn’t require touchy threshold adjustments, is taking the 
third (or fifth) power of the signal. 

But the most effective method of removing the bothersome formant struc- 
ture from a speech spectrum is the so-called cepstrum method [11.36]. The 
cepstrum, originally conceived to distinguish underground nuclear explosions 
from earthquakes, is defined as the Fourier transform of the logarithm of the 
power spectrum. Without the logarithm, one would obtain the autocorre- 
lation functions with the formant-peak problems already mentioned. In the 
cepstrum, by contrast, formant effects are almost completely removed from 
the pitch representation. To see this, we model a voiced speech signal s{t) as 
a convolution of periodic excitation pulses p{t) and the response h{t) of the 
vocal tract (including the glottal waveform and radiation from the lips): 

s{t) = p{t) 'k h{t) . 

Fourier transformation turns the noisesome convolution (which “mixes up 
the formant and pitch information”) into a simple product: 

5(cj) =p{uj) • h{u;) . (11.97) 

Taking logarithms turns the simple product into an even simpler sum: 

log s{lu) = log p{u) + log h{uj) . (11.98) 

The effects of pitch, p, and vocal tract, h, are now simply additive and can 
be clearly separated by a further Fourier transformation: 

c{q) = log s(q) = logp{q) -f log h{q) , (11.99) 

where q is the new Fourier variable having the dimension of time and called 
quefrency by Tukey (and almost everyone else). The function c{q) is called 
the complex cepstrum, which has found many applications in “homomorphic 
filtering” and homomorphic vocoders [11.37,11.38]. If s is replaced by its 
magnitude, the regular cepstrum results. 

Figure 11.1 shows the cepstra for an utterance containing voiced speech. 
The left half of Fig. 11.1 shows a sequence of logarithmic spectra for the ut- 
terance and the right half shows the corresponding cepstra. The sharp peaks 
on the right, at high quefrencies, corresponds to the lengths of the pitch pe- 
riods; it is caused by the high-quefrency “ripple” visible in the power spectra 
of the voiced sound. The formants lead to a smooth frequency variation in 
the power spectrum and show up near the origin of the cepstrum. Thus, pitch 
and formant effects are completely separated by the cepstrum, at least for 
voices whose pitch is not so high that the corresponding spectral ripple has 
too low a quefrency. 

For an even better pitch detection method see Appendix B, Sect. B.7. 
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Fig. 11.1. Left: a sequence of logarithmic short-time spectra of a partly voiced 
utterance (/chase/). Note the harmonic ripple of the spectra during the voiced 
portion (/a/). Right: sequence of cepstra. The cepstrum is defined as the magnitude 
of the Fourier transform of the logarithmic power spectrum. Note the prominent 
peaks around quefrencies of 9 ms, corresponding to fundamental frequencies near 
110 Hz, during the voiced vowel sound. The low-quefrency clutter (below Sms) 
represents formant and other information in the logarithmic power spectra that 
change smoothly along the frequency axis. This is the information in the spectral 
envelope (as opposed to the spectral fine structure represented by the harmonic 
ripples) 
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11.15 Line Spectral Frequencies 

While the cepstrum has proved very useful to characterize the spectral fine 
structure of a periodic signal, the so-called “line spectral frequencies” are an 
efficient means to represent spectral envelopes and filter responses. 

Let P{z) be the 2 ;-transform of an all-zero filter with all its zeros Zk inside 
the unit circle. For example, P{z) could be the reciprocal of a stable all-pole 
filter, such as a linear prediction filter. Such filters are usually characterized 
by the locations of their complex zeros Zk inside the unit circle - angles 
corresponding to the frequencies of the filter’s resonances and magnitudes 
to their band widths. To quantize these locations efficiently has sometimes 
proved difficult. 

As an alternative we introduce a new concept, the line spectral frequen- 
cies [11.39]. They are defined as follows. First, we introduce the reciprocal 
polynomial of P{z): 

P{z) := Q]) , (11.100) 

where n is the degree of P{z). It follows that 

m - ® ( 11 . 101 ) 

has an allpass response, i.e. |L(e^“’)| = 1. 

The symmetric polynomial 

A{z) := ^{P{z) + P{z)) (11.102) 

has zeros for L{z) — —1. On the unit circle, z — , we have 

L{z) = , 

where is the phase of P{z) (to within a constant phase). Thus, the zeros 
of A(z) occur for 

(p{uJk) = (^k + TT , fc = 0, 1, • • • ,n - 1 (11.103) 

and there are exactly n such frequencies in the interval 0 < u < n. (This is 
not necessarily true if some zeros of P{z) fall outside the unit circle.) The 
frequencies uok are called line spectral frequencies. 

Similarly, for the antisymmetric polynomial 

B{z)-.= ]^{P{z)-P{z)) (11.104) 

the zeros occur for 



^{^m) = m = 0, 1, • • • , n — 1 . 



(11.105) 
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Thus, the zeros of A{z) and B{z) fall on the unit circle and are interleaved 
with each other. 

The line spectral frequencies are not necessarily close to the zeros of P{z) 
and they are easier to quantize than predictor coefficients. 




A. Acoustic Theory 

and Modeling of the Vocal Tract 



by H.W. Strube, Drittes Physikalisches Institut, Universitat Gottingen 



A.l Introduction 

This appendix is intended for those readers who want to inform themselves 
about the mathematical treatment of the vocal-tract acoustics and about its 
modeling in the time and frequency domains. Apart from providing a funda- 
mental understanding, this is required for all applications and investigations 
concerned with the relationship between geometric and acoustic properties 
of the vocal tract, such as articulatory synthesis, determination of the tract 
shape from acoustic quantities, inverse filtering, etc. 

Historically, the formants of speech were conjectured to be resonances of 
cavities in the vocal tract. In the case of a narrow constriction at or near 
the lips, such as for the vowel [u], the volume of the tract can be considered 
a Helmholtz resonator (the glottis is assumed almost closed). However, this 
can only explain the first formant. Also, the constriction - if any - is usu- 
ally situated farther back. Then the tract may be roughly approximated as a 
cascade of two resonators, accounting for two formants. But all these approx- 
imations by discrete cavities proved unrealistic. Thus researchers have have 
now adopted a more reasonable description of the vocal tract as a nonuni- 
form acoustical transmission line. This can explain an infinite number of res- 
onances, of which, however, only the first 2 to 4 are of phonetic importance. 
Depending on the kind of sound, the tube system has different topology: 

- for vowel-like sounds, pharynx and mouth form one tube; 

- for nasalized vowels, the tube is branched, with transmission from pharynx 
through mouth and nose; 

- for nasal consonants, transmission is through pharynx and nose, with the 
closed mouth tract as a “shunt” line. 

The situation becomes even more complicated for plosive and fricative 
consonants, where one must further take into account different places and 
kinds of excitation. Instead of - or in addition to - glottal oscillation, there 
is a turbulent-noise source at any narrow constriction, and for plosives, a 
sudden pressure release after opening a closure. 
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In this appendix, we will present a fundamental description of a one- 
dimensional tube in the time and frequency domains, show the connection 
between tube shape and formants, and present methods for time-domain 
modeling of sound propagation as well as frequency-domain computation of 
transfer functions and impedances. The “inverse problem” of how to estimate 
the tract shape from acoustical data will also be discussed briefly. 



A. 2 Acoustics of a Hard- Walled, Lossless Tube 

To keep the formulas simple, we will present the fundamental equations for 
the hard- walled, lossless tube only, but first allowing time- varying tube shape. 
The more general case will only be described in the frequency domain for a 
time-invariant tube shape. In addition to the well-known representation by 
pressure and volume velocity, less familiar representations will be introduced 
that are useful for modeling and computation or show analogies to other fields 
of physics. 

A. 2.1 Field Equations 

The acoustic field equations are derived from the Navier-Stokes fiow equa- 
tions by linearization, assuming that fiow velocity is small compared to the 
speed of sound, c. Furthermore, the non-zero average (dc) air fiow in the vocal 
tract is neglected (its effects on the formants would be of second order only). 
But keep in mind that at narrow constrictions, nonlinear effects can become 
important, e.g. when whistling or in the glottis. Additional approximations 
are: 

(1) The curved vocal tract is treated like a straight tube. 

(2) The waves propagate one-dimensionally along the tube axis x and are 
approximately plane. This requires that the slope of the tube walls be small. 

(3) No higher modes with nodes over the cross-section are taken into account 
[they are removed by the integrals in (A.l) below]. In the vocal tract, higher 
modes cannot propagate below about 4 kHz. 

Thus the tube is entirely described by its “area function” A(x, t). Let z 
be the coordinates in the cross-section plane. The appropriate field quantities 
are the alternating (ac) pressure averaged over the cross-section, p(x,t), and 
the volume velocity g(x,t), defined as 

^ ^ (A.l) 

q(x,t) = JJv:^{x,y,z,t)dydz . 

A 

Here is the three-dimensional ac pressure field and Vx the x component of 
the velocity field. The ac density g{x, t) is defined analogously to p(x, t) and 
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proportional to it (state equation); the constant average density will be de- 
noted by ^0 • The motion is then described by the Euler equation (“Newton’s 
law”), the (one-dimensional) continuity equation, and the state equation: 

Qo{q/A)' = -p' , (A.2) 

{qA) + qqA = —Qoq' , (A. 3) 

p = c^Q . (A. 4) 

Here, a dot denotes d/dt, a prime means djdx . The proportionality factor 
(? in (A. 4) will in fact turn out to be the speed of sound (phase velocity). 
The second term in (A. 3) represents a flow source due to the motion of the 
tube walls. Since in the vocal tract these move too slowly to generate audible 
sound, this term will henceforth be neglected. Then the last two equations 
can be combined into 



(j>A)'!qq(? = -q' . (A.5) 

Obviously, the two held equations (A.2) and (A.5) are of a form analogous to 
those of a lossless electrical transmission line, if p is identifled with voltage 
and q with current and 

L' == Qq/A , C' — Aj qq(? (A. 6) 

correspond to an inductance and capacitance density, respectively. These are 
not independent, since L'C' = is constant. Thus the tube may be com- 
pletely described by the (acoustic) characteristic impedance: 

Z = yjL'/C = qoc/A ; (A.7) 

then L' = Z/c, C — l/Zc^ and for any derivative or variation “9” we have 
dA/A = dC'/C' = -dL'iL' = -dZ/Z . (A.8) 

Equations (A.2) and (A.5) are rewritten as 

~p' = {L'q)' = c~'^{qZy , (A.9a) 

-q' = (C'p)' = c~^{p/Zy . (A.9b) 

The inflnitesimal transmission-line element is then that shown in Fig. A.l. 
Note that in the time-varying case, L' contains a quasi-resistive and C' a 
quasi-conductive component, since {L'q)' = L'q + L'g, etc. 

The energy balance of the tube can be derived by multiplying (A. 9a) and 
(A. 9b) with q and p, respectively, and adding them, leading to the continuity 
equation 



L djc 



C dx 

Fig. A.l. Electrical equivalent circuit of the 
— infinitesimal lossless transmission-line element 
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{Wp + Wq)' + P' = {Wp - Wq)ZlZ , 

Wp = cy /2 , Wq = L'q^/2 , P 



(A.IO) 



where Wp and Wq are the potential and kinetic energy densities and P is 
the power (energy flow). The right-hand side represents an energy source 
density due to work of the moving walls against the radiation pressure in the 
tube. This term vanishes in the time-invariant case, so that energy is then 
conserved. 

Note that all equations are invariant under the duality transformation 

Z^ljZ. (A.ll) 

Another familiar representation uses the velocity potential related to 
p and q according to 

p = Qo ^ , q = -A^' . (A.12) 

These equations imply (A. 2) or (A. 9a) automatically. Inserting them in (A. 5) 
now yields a second-order wave equation, namely Webster’s horn equation, 
generalized for time- varying A\ 

c-^{4A)' = , (A. 13) 

which has a completely symmetric form in the space and time derivatives. 
Another similar form with A replaced by 1/A - again corresponding to the 
duality transformation (A.ll) - can be obtained by using the volume dis- 
placement J qdt instead of 

When A and thus L', C', Z are constant in time, Webster’s horn equation 
can be written in its familiar form, not only for velocity potential and volume 
displacement but also for pressure and volume velocity themselves: 

p"p {A'/A)p' - c~^p = 0 , (A.14) 

q"- {A'/A)q' -c-^q^Q . (A.15) 



Other Representations 



a) Square root of energy density. By replacing p and q with the corresponding 
square-root-of-energy-density components, 

^ = (A.16) 

the field equations now read 

= -(Z-i/V)' , (A.17) 

or equivalently, 

c-^{ip + Q^p) = -^l^' -Wi> , = + Wip , (A.lSa) 

with Q = Z/2Z = -A/2A , W = Z' (2Z = -A' /2A . (A.lSb) 

In the time-invariant case (Q = 0), this representation removes the first-order 
derivatives of the fields from the Webster horn equations (A.14), (A.15), 
yielding Schrodinger (or rather, Klein-Gordon) type equations instead: 
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—^|J" Vp'il; c ^'0 = 0, (A. 19) 

+ c = 0 , (A. 20) 

with potentials 

Vp^iVAY'lVA =W^-W', 

Vg = (l/\/A)" \/A =W^ + W' . ^ ' 

This representation may be useful for eigenvalue problems and inverse prob- 
lems. In the quantum-mechanical terminology, Vp and Vg have the form of 
supersymmetric partner potentials with W as superpotential. 

b) Wave quantities. As will become apparent in Sect. A. 3 below, it is often 
advantageous to transform p and q to an equivalent representation by “right” 
and “left” traveling waves, whose dimension may be pressure, volume velocity, 
or square root of power. With Z from (A. 7), these waves are defined as 



p±^{p± qZ)/2 , (A.22) 

q± = {plZ ± q)/2 , (A.23) 

V'i = ?V^)/2 = (^ ± <^)c/-\/2 . (A. 24) 

The field equations for pressure and volume- velocity waves are, with Q and 
W as in (A. 18b), 

c“ = c~^p- - = (Q - W)p- + (Q + hF)p+ ; (A.25) 

= -c~^q- +q'- ={Q- W)q- ~{Q + W)q+ ; (A.26) 

those for square-root-of-power waves are 

c“ + '0V = (0 “ - 00 = (Q -h W)0+ . (A.27) 



From this is it obvious that W represents a refiection-factor density due to a 
spatially changing Z. This is also clear since Z'/2Z is the infinitesimal form 
of the well-known expression (Z 2 — Zi)/(Z 2 + Zi) for the reflection factor 
at a discontinuity of the characteristic impedance. Time variation (Q ^ 0) 
causes special additional reflections. 

However, the reader should be warned that these waves have only a formal 
meaning. They are not really unidirectionally traveling waves, except in the 
constant, uniform tube (Q = W = 0). The idea behind this representation 
is to consider the nonuniform tube as a sequence of infinitely short uniform 
segments with infinitesimal refiections at their boundaries (similarly for tem- 
poral changes), which will be explicitly used for discrete modeling in Sect. 
A.3. 

The energy balance can be written in terms of the powers 0^ right 

and left traveling waves: 

+ V’-)' + (^+ - Y’-Y = 2 QV’+^/’- • (A. 28 ) 

Thus the energy density is w = c“^(0V +0?.) and the power is P = 0^ — 0?. . 
The source term on the right-hand side again vanishes in the time-invariant 



case. 
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A. 2. 2 Time-Invariant Case 

The vocal- tract motions are rather slow compared with acoustic frequencies, 
\A/A\ <C |c<;|, except in the oscillating glottis and perhaps during plosive 
bursts. Thus the assumption of time-invariance is usually justified. Under 
this condition, the time derivatives of the field quantities can be eliminated, 
transforming all the partial differential equations into ordinary ones. This 
is achieved by Fourier transforming with respect to time. Since the field 
equations are linear, they hold for each Fourier component separately. For any 
field quantity /(x, t) the component with angular frequency uj is /(x, cu) 
i = , where /(x, cj) is the (complex) Fourier transform of /(x, t). (We are 

using the “engineering” convention here. Theoretical physicists prefer 
making all quantities their complex conjugates.) Thus, any time derivative is 
turned into a multiplication by ia;. For example, (A. 9a), (A. 9b) become 

—p' = iujL'q , —q' = iLoC'p . (A. 29) 

In the remainder of this section, we will always restrict ourselves to the 
time-invariant case with iuj replacing time derivatives. To avoid complication 
of the formulas, the field quantities will be written without “hats” again, 
equivalent to a pure e^“^^ time dependence with fixed uj. Also, the argument 
t, not explicitly occurring any more, will be omitted. 

However, since the fields are now complex rather than real, nonlinear 
expressions such as energies and powers must be reformulated. The energy 
densities and the power from (A. 10) are replaced by 

Wj, = C'\p\^/2, W 5 = L'MV2, P = Re{p*q}, (A.30) 

where the asterisk denotes the complex conjugate. For a single compo- 
nent, all three are independent of time. Consequently, in the lossless case, P 
is constant in space according to (A. 10). 

“Uniform” Waves. In uniform tubes, all solutions are linear combinations 
of the two unidirectional traveling waves exp(icc;^ =p iA;x), with a dispersion 
relation connecting uo and k (here \u;/c\ = \k\). For the square-root-of-energy- 
density fields (/?, such waves - let us call them “uniform” - even exist 
in some nonuniform tubes, since only Vp or Vq in (A. 19), (A. 20) need be 
constant. For this is the case if ^ A(x) is constant or linear (Vp = 0) or of 
the form -h be~^^ with Vp — oy acos(ax) -h 6sin(ax) with Vp = —a^ 
(a, 6, a real so that these expressions are > 0 in the considered interval). For 
the corresponding holds for l/y/A{x) and Vq. The dispersion relation is 

k^ = {uj/c)‘^ T , (A. 31) 

where ’ holds with the exponential and ‘T’ with the trigonometric form. 
For no wave-like propagation is possible if |a;/c| < |a| ; instead there is 
exponential decay exp(ia;t d= kx) , — (cj/c)^ . Only for the constant 
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or exponential area functions do both ^ and Lp have uniform- wave solutions, 
and then Vp = Vq , [al = |W1 - |A 72 A| . 

As mentioned above, these uniform waves are not identical with the formal 
waves etc., except for the uniform tube. With another transformation of 
the fields, this works, however, also for conical and hyperbolic tubes. 

Conical and Hyperbolic Tubes. For a conical tube {VA linear), Vp = 0; 
for a hyperbolic tube (1/y/A linear), Fg = 0. Define for the first case 

X = (p + WcJ'ipdt; 4>± = {'ip±x)/^- (A.32) 

Then these redefined wave quantities travel without dispersion and reflec- 
tion. At a junction between two conical tubes, even with continuous A, they 
are scattered; however, the scattering occurs not in a memory less way but 
through first-order filters with exponential impulse response. If the change of 
area slope is negative, the response is increasing, i. e., the filter is unstable. 
Nevertheless, a chain of conical segments is stable due to cancelation of the 
increasing responses (not very nice for numerics, of course). 

The hyperbolic case can be obtained from this by the duality transforma- 
tion, A ^ W ^ — LF, ip ^ (f. 

A. 2. 3 Formants as Eigenvalues 

The formants can be defined as the resonances of the tube, given certain 
boundary conditions at the glottis and at the lips. The glottal excitation 
flow is disregarded, the treatment being based on free oscillations. The glottis 
(situated at a; = 0) is of rather high impedance compared to the tract, whereas 
the lips (at X = /) have a relatively low radiation impedance, especially at 
low frequencies. Thus the crudest approximation of the boundary conditions 
would be Z(0) — oo (“hard”), equivalent to g(0) = 0 or p'{0) = 0 or (^(0) = 0 
or 770 ) + ^(0)^(0) = O 5 Z{1) = 0 (“soft”), meaning p(/) = 0 or q'{l) = 0 
or 7(0 = 0 or = 0. A somewhat better approximation would 

use an inductive lip termination, Z{1) = icjLrad , representing the radiation 
mass load but still neglecting the energy loss due to radiation. However, we 
do not consider losses at this stage. 

Under these conditions, our problem can be treated as an eigenvalue prob- 



lem of the Webster horn equations (A. 14), (A. 15) 

p"+ [A! IA)p' + {colc)‘^p = 0 , (A.33) 

q"- {A'/A)q' + {cv/cfq = 0 , (A.34) 

or the corresponding Schrodinger equations (A. 19), (A. 20) 

-7" -h Up7 - (^/c) V = 0 , (A.35) 

-p" -h Vqip - {pjjcf'p — 0 . (A. 36) 



All these equations with their boundary conditions are of Sturm-Liouville 
type. Their solutions fulfill the boundary conditions only for a discrete set of 
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uj values, which give us the formant frequencies. This was first done for (A. 14) 
by Ungeheuer [A.l] based on plaster casts of actual vocal tracts, using the 
simplest boundary conditions above. The solutions were obtained iteratively 
by perturbation methods, e.g., treating {A' /A) as a small quantity. 

Note that the above-mentioned inductive termination would work only for 
the volume- velocity equations (A. 34), (A. 36) in the form of Lrad^^ + = 0. 

For the pressure equations, the corresponding boundary condition would itself 
depend on u;^, presenting a generalized eigenvalue problem. The same holds 
for capacitive termination in the opposite case, which, however, plays no role 
in the vocal tract. 

In the case of ideal “hard” and “soft” termination, the eigenvalues depend 
on In A{x) only; there is no absolute area scale. But in the general case this 
holds only if the terminating impedances are scaled in proportion to 1/A, 
otherwise they fix the scale. 

Unfortunately the Schrodinger forms of the equations have very different 
boundary conditions from those in quantum mechanics. If we digress from the 
vocal tract for a while and assume the tube to be terminated equally hard or 
soft at both ends, these equations show an interesting invariance with respect 
to area- function transformations: Vp is not changed if 



A 



A 







2 



(A.37) 



with a, /? arbitrary but so that the parenthesized expression has no zeros 
in 0 < X < 1. (Note that a has the dimension of a length.) Consequently, if 
p = 0 at both ends, the eigenvalues and even the potential-energy distribution 
Wp = remain unchanged! The corresponding holds in the dual case for 
Vq with ^ = 0 at both ends, and 



A 



A 




(A.38) 



(Here a has the dimension of length”^.) As special cases, a soft-terminated 
conical tube {VA — linear function of x) and a hard-terminated hyperbolic 
tube (1/\/A = linear function of x) have the same resonances as a uniform 
tube of equal length and termination. However, the application of (A.37) 
to a hard termination changes this termination to an inductance, and the 
application of (A.38) to a soft termination changes this to a capacitance. 



Area Perturbations. Given the solution for a specific eigenvalue (formant 
frequency), how is this frequency affected by infinitesimal perturbations 5 A 
of the area function? Starting from (A. 33), it can be shown that the Webster 
operator W = -^-\-{A' / A)^ is self-adjoint with respect to the scalar product 

{f\d) — fo {p\Wp) = (yVp\p) = — (^/c)^(pIp). Variation of this 

expression yields 
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S{u;/c)‘^ = —J A S p' dx ^ j A\p\^dx, 

from which, by partial integration and substitution of (A. 29) and noting 
further that potential energy Wp dx = kinetic energy Wgdx = ^ total 
energy we obtain 



6lo 

uo 



"eE- 



Wp)^-^Ax . 



(A.39) 



This is a well-known result obtained by Schroeder [A. 2] and, for a discrete 
LC line, by Fant [A. 3]. The integral is SE, in accordance with Ehrenfest’s 
theorem; it is also the work done against the radiation pressure in the tube 
[A.2], 

In the special case of a uniform tube with q{0) = 0 and p{l) = 0, the 
solutions for the nth eigenvalue (n = 1, 2, . . .) are q{x) = sin ((n — ^)ttx/ 1), 
p(^x) = cos ((n — ^)7Tx/l), so that Wq — Wp = — cos ((2n — 1)7 tx//). This is an 
antisymmetric function about the tube center x = ll2. Consequently, if bAjA 
is symmetric, it cannot contribute to the perturbation. More specifically, 
because cosines of different n are orthogonal, only the odd spatial cosine-series 
component of form cos ((2n — 1)7tx/1) will contribute to the perturbation of 
the nth formant frequency. This is also a well-known result from [A.2]. 



Length Perturbations. Consider monotonic perturbations of the x axis, 
X ^ X d- Sx{x)^ Sx{0) — 0, dx' > —x'. This can be reduced to the previous 
case by setting SA = —A'5x. Partial integration in (A.39) then yields, with 

W = Wq d- Wp'. 

i 

wSx'{x)dx . (A.40) 

Linear variation of the total length gives Sw/w = —61/1 , a,s it trivially should. 
The corresponding result for a discrete LC line is also contained in [A. 3]. 



^ E In 



A. 2.4 Losses and Nonrigid Walls 

Linear losses within the vocal tract have two main sources: viscosity and heat 
conduction. (In constrictions, there may be additional, nonlinear losses due 
to turbulence.) At speech- acoustic frequencies, both occur in a narrow layer 
at the tube wall, whose thickness is proportional to 1/v^- If this is small 
compared to the tube radius, the viscous losses can be represented by an 
impedance density in series with iwL', given by (S'/A^)(icc;^oM )^^^5 where S 
is the circumference of the tube cross-section and p is the viscosity. Since 
y/iu; = (1 + isgmj)y/\Lo\/2 , this adds both an inductive and a resistive term 
of equal sizes. For a derivation of this and the following expressions, see [A. 4]. 
Note that in complex square roots, we always use the solution with positive 
real part. 
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L' dx R' dx 




X -n 



Fig. A.2. Infinitesimal element of a transmission line with losses due to viscosity 
and heat conduction; a wall admittance is also included 



Likewise, the losses due to heat conduction add an admittance in parallel 
to given by S{rj — l)gQ^c~‘^{iLuX/cpQo)^^‘^ ^ where 77 is the adiabatic ex- 

ponent, A the heat conductivity, and Cp the specific heat at constant pressure. 
Thus we have to substitute 

iujL^ — > \ujIj' -\- R' = icjZ/Q -T (1 ~h i sgn u)^R\ — Lq |ct7 1 , 

iujC' — > ILVC' + G' = icjG^ -f (1 + isgncj)G', G' = G^ -h G'/\u;\ , 

where Lq, Gq now denote the original, frequency- independent L', C' and 
i?', G' are proportional to • This extends the circuit of Fig. A.l into 
that shown in Fig. A. 2. The L', G', i?', G' in this figure do not actually 
correspond to such electronic components because of their dependence; 
they just indicate imaginary and real parts of the longitudinal impedance and 
transversal admittance. Moreover, these impedances are of a very unpleasant 
form, because does not correspond to any closed- form or finite-order 

differential or convolutional operator in the time domain. This precludes any 
exact time-domain simulation of the lossy vocal tract on the computer! 

The characteristic impedance and phase velocity now also become com- 
plex and frequency dependent. Instead of the phase velocity, we give the 
propagation coefficient 7 = iuj/c. For small losses, i.e. R' /\uj\L' 1 , 



G' /\uj\G' 1, we have, with Zq = ^/Lq/Gq , 




Z = {{icoL’ + R')/{iu;C' + 




« Zo(l + (1 + isgno;)(i?7L' - G7C')/2iw) , 


(A.41) 


7 = ((iwL' + R%ujC' + 




^ iuj y/ L qGq {1 1 sgna;)(i?Y Zq G' Zq) /2 . 


(A.42) 



Thus, apart from adding a damping term Re{7}, R' and G' also slightly 
change the phase propagation, Im{7}. However, the modification of L', G' 
by the ^/\o^\ terms can often be neglected, except at very low frequencies. 

Nonrigid walls. It is also possible to approximate the effect of nonrigid walls. 
We assume the walls to be locally reacting (i.e. there is no lateral coupling 
along the tube) and of constant characteristics around the circumference of 
one cross-section; moreover radiation through the walls is neglected - surely 
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daring assumptions! Then the wall impedance may be represented as a mass- 
spring-damping combination, electrically expressed by an LRC series reso- 
nance circuit. In Fig. A. 2, this circuit must be inserted in parallel with C 
and G' as a shunt admittance density -h Rw + thus 

changing Z and 7 again. 

The series-resonance frequency of is very low, e.g. 30 Hz, so that the 
spring Cw can often be neglected. The mass, however, shifts the formant 
frequencies upwards, also their damping increases due to wall losses. For ex- 
ample, when a constriction at the lips goes to zero, the first formant no longer 
approaches zero as it would in the hard- walled tube but remains bounded at, 
say, 170 Hz. 

For certain distributions of the wall impedance the wave equation can 
be solved exactly [A. 5]. However, since the effect is large only for the first 
formant, the wall impedance may often be represented by one or a few lumped 
LRC or LR shunts instead of by the above distributed shunts. 



A. 3 Discrete Modeling of a Tube 

By “discrete modeling” we mean the spatial discretization of the area function 
for the purposes of simulating the sound transmission in the time domain, e.g. 
for articulatory synthesis, or of computing transfer functions and impedances 
of the vocal tract in the frequency domain. The spatial segments may be of 
unequal length. Historically, there were analog hardware implementations 
by electronic components in the form of Figs. A.l or A. 2. But we will base 
our considerations only on the wave equations themselves. Under simplifying 
assumptions, these can be simulated efficiently on the computer in the time 
domain. Then the time axis must be also discretized, thus turning the whole 
system into a digital filter. 

A. 3.1 Time-Domain Modeling 

As mentioned in Sect. A. 2. 4, no time-domain modeling of an actual lossy vocal 
tract is possible. We will restrict ourselves to the simplified case of nondisper- 
sive propagation and frequency-independent, real characteristic impedance, 
i.e., L'C' = const; R' / L' = G' jC'\ R', G' frequency independent. At first we 
even give the equations for the lossless case only. Also the wall impedance 
will initially be neglected. 

As the wave propagation is especially simple in uniform tubes, we will 
consider uniform segments, also piecewise constant in time. Then in intervals 
where A{xR) — const, there is free propagation of left and right traveling 
waves along the characteristics Ax = zbcdt, whereas at the interval bound- 
aries, reflections occur. Note that the spatial segmentation is at variance 
with our initial assumption that the slope of the tube walls is small: here it 
is infinite at the segment boundaries! Thus the approximation is physically 
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good only for small relative area changes. (Other tube forms without discon- 
tinuities, for instance, composed of exponential or conical segments, cause 
numerical stability problems [cf. A. 2. 2], and exponential segments moreover 
are no finite-order systems. But they can, of course, be used in frequency- 
domain simulations.) 

We assume a minimal length d of which all spatial segments are integer 
multiples. As both x and t are discretized, there will be a relation between d 
and the sampling period T = l//s . Intuitively one would guess d — cT, but 
it turns out that d = cT/2, /s = c/2d is sufficient, if the sampling epochs for 
adjacent segments are displaced by T/2. This is clear from Fig. A. 3, where 
the spatiotemporal lattice of pulse propagation is shown. (The temporal area 
changes should occur at intervals T/2 or multiples thereof.) If instead fs = c/d 
were chosen, two causally nonconnected pulse-trajectory lattices would be 
present. A sampling frequency of 10 kHz now corresponds to a segment length 
of 17 mm. Unfortunately, the computational expense increases with d~^ or 

/s^- 




Fig. A. 3. Spatiotemporal 
pulse trajectories in a seg- 
mented tube; temporal sam- 
pling interval 2d/c; dashed: 
regions of constant Z = Zn,k 



We use index n for space and k for time {k will never mean wave number 
here). However, to keep the indexing simple, we will index time at double 
rate, as if fs = c/d, but noting that in one of two adjacent segments, only 
the odd k occur and in the other one, only the even k. (Alternatively, we 
could use an oblique space-time coordinate system with space axis parallel 
to the right traveling waves, thus choosing different time zero points for each 
segment.) The continuous fields are replaced by pulse sequences of amplitudes 
Pn^k etc. at the sampling frequency, traveling along the diagonals of the dxd/c 
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rectangles of constant Z = Zn^k • For the p, q samples, n, k refers to the lower 
left corner of this rectangle. 

As vocal-tract motions are rather slow compared to acoustic frequencies, 
we give here only the formulas for the time-invariant case, neglecting reflec- 
tions due to temporal area changes {Q = 0, = Z^). Then it is known 

that p, q are continuous at the segment boundaries (?/^, p are not!). The dif- 
ference equations can best be derived from the constancy of the waves p+, p_ 
(A. 22) along the diagonals: 



Pn-\-l^k-\-l Znqn-\-l,k+l — Pn,k T ^nQn,k 5 

Pn,k-\-l ^nQn,k-\-l — Pn-\-l,k ^nQn-\-l,k • 



(A.43) 



These can be resolved for any two of the (p, q) pairs, given all the others, 
depending on the task, for instance, computing future from past values: 



Pn,k-\-l — 



— P'n-\-l,k~\~Qn — l,k Qn-\-l,k 

n — 1 ' ^ 

Pn — l,k Pn-\-l, k~^^n — 1 Qn — 1 , fc ~\~^n Qn-\-l , k 

^n,k+l - Zn - i+Zn 



(A.44) 



However, it is computationally much more efficient to directly employ 
the wave representation. (The wave subscripts -h, — will now be written as 
superscripts because of the other subscripts n, /c.) Here the outgoing waves 
are obtained by the scattering of the incoming waves at a boundary: 



Pn+l,fe+l = i'^+rn)Pn,k 



Pn,k-\-l 



^'^Pn-j-l,k 

= (1 -»^nK+l,fe +rnP^k ; 






-^n+1 



~ Zn 



Ar , 



^n+l 



Zn+1 ~h Zji An + A. 



< 1 , 



n+1 



(A.45) 

(A.46) 



which are the discrete form of (A. 25) for Q = 0. These can be extended for 
the time- varying tube, see [A. 6], needing two different reflection factors; but 
this is usually not required. In the (p, g)-representation, the time- varying case 
is more difficult to treat, because p and q are not continuous in space and 
time at the corners of the constant-Z rectangles and thus cannot be assigned 
unique values there! 

For volume- velocity or square-root-of-power waves, the scattering equa- 
tions differ from (A.45) only in the form of the transmission factors, if the 
same from (A.46) is used here too (although the actual reflection factor 
for q is — r^i): 

for q waves, (1 ± r^) — > (1 T^n) , 

for waves, (1 ± r^) — ^ . 

The p and q waves have the advantage that they can be implemented with a 
single multiplier (and three adders), but may result in an unpleasantly large 
range of values when approaches ±1. The waves require four multipliers, 
but their values, directly related to power, are less prone to become large. In 
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the terminology of wave digital filters (WDF) [A. 7], the scattering equations 
(A. 45) describe a “two-port adaptor” . 

If the simulation is to include the nasal tract, at the branching the equa- 
tions become more complicated, based on continuity of p and conservation 
of q. In the WDF terminology, this is described by a “three-port parallel 
adaptor” , with Vn replaced by two parameters. 

Damping. In the wave representation, a frequency-independent damping 
as mentioned above can easily be introduced: Simply multiply each wave 
quantity by a factor smaller than 1 between two time steps. 

Termination and Wall Impedances. For the complete tract model, not 
only the tube but also the termination impedances at lips, nostrils, glot- 
tis, and possibly the wall impedances must be modeled. In most cases these 
impedances will first be represented by analog one-port networks, whose dif- 
ferential equations are then approximated by difference equations. These de- 
fine a relation between the p and q signals or between the two wave signals 
at the connection. In the wave representation, it is natural to use the WDF 
formulation of these networks and attach them using the appropriate adap- 
tors. For instance, if a is the incident and b the reflected wave, an inductance 
L is described as 5 = —z~^a or bk = —cik -2 with a port impedance 2/sL, 
a capacitance C as 5 = z~^a with a port impedance l/(2/sC), a resistance 
R as b = 0 with a port impedance R. But note that, whereas the wave rep- 
resentation of the segmented tube is exact, the WDF description of discrete 
components is equivalent to the bilinear ^:-transform, distorting the frequency 
scale according to to ^ 2/s arctan(a;/2/s). One has to choose a sufficiently 
high sampling rate, also desirable for fine segmentation. 

Sources in the Tract. A noise source in the tract for fricative excitation is 
obtained by adding random numbers to the held quantities. Their standard 
deviation must depend on area and flow in the segment in an appropriate 
way. A plosion burst can be simulated by opening a closed tract after pressure 
buildup. Both these sources require the inclusion of a dc flow or subglottal 
pressure in the simulations. Also in narrow constrictions, the damping must 
become large to avoid the occurrence of unnaturally high field values. 

A. 3. 2 Frequency-Domain Modeling, Two-Port Theory 

In the time-invariant frequency-domain description of the vocal-tract model, 
we have full freedom with respect to dispersion and damping. The most 
convenient approach is the representation of segments by the chain or trans- 
mission matrices of the two-port theory, which are easily multiplied to yield 
the corresponding matrix of the whole tube. From these and the terminating 
impedances, the transfer functions and input impedances are obtained. Let 
the segments be uniform again (e.g., conical or exponential shapes would 
also be possible), but we now allow the lengths to be arbitrary {d dn)- 
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If no direct time-domain modeling is possible, the transfer functions com- 
puted in the frequency domain can be used to implement a digital filter, 
possibly by Fourier techniques (overlap-add). 



(p,q)-Representation. The chain matrix of a segment connects the values 
at the segment input with those at the output: 



Pn 




cosh( 7 „(i„) Z„sinh( 7 „c/„) 


Pn 


Qn 


in 


_Z“^sinh(7„rf„) cosh(7„d„) 


Qn 



(A.47) 



Without damping, = Q^cjAn, 7^ = io;/c, cosh(7^<in) = cos(c(;dn/c), 
sinh(7n (in) = ism{ujdn/c); with damping, see (A. 41), (A. 42). Even with finite 
wall impedance, the form of the matrix remains the same. Because p and q 
are continuous, (pn, Qn)out ~ (Pn+i, ^n+i)hi ^ so the matrices from (A.47) 



can simply be multiplied to give the total chain matrix; let us call it q j) ' 
Obviously the determinant is AV — BC = 1 , so that the inverse matrix is 



V -B 
-C A 



. This means, reciprocity is fulfilled: 



/ Pout \ _ _ / Pin \ 

V / 9out=0 KQoutJq.^^o 

f Qout \ / ^in \ 

V / Pout =0 V Pout J p,^=Q 

In the (V^, if) representation, the matrix in (A.47) contains no Zn ; instead 
each of the segment boundaries also has a chain matrix 

diag(yZ„+i/Z„, ^ZnlZn+i) connecting with . 

In the lossless case, the segment matrices are unitary. 

Let the tube be terminated by a radiation impedance Zrad • Then the 
transfer function for volume velocity is 



(A.48a) 

(A.48b) 



Hq = ^out/gin = 1/ (C^rad + T^) ■ (A.49) 

The far-field sound pressure in front of the mouth is proportional to luJQont • 
The input impedance of the tract “seen” by the glottis is 



Zin — Pin/ ^in — (-TZrad + B) / {CZr^d + T^) • (A. 50) 

In this manner, all quantities of interest can be calculated. When the tube is 
branched, the parallel combination of the input impedances of two branches 
is taken as the output impedance of the third branch. 



Radiation Impedance. The radiation impedance of the actual mouth can- 
not be written in closed form. The simplest approximation (beyond short- 
circuit or inductance) using constant components will be a parallel combina- 
tion of inductance and conductance, 

Zrad — 1/ ((l/i^Tj-ad) T ^rad) 5 T^ad ^ A , G^rad 0^ A , 



(A.51) 
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where A is the lip-opening area. This would hold exactly for a pulsating 
sphere, with Lrad = Qo/V 47t A , Grad = Afgoc. The proportionality factors 
are often derived from a piston in a plane baffle or a piston in a sphere of the 
size of the head. But in the frequency domain, there is actually no need to 
consider discrete-component circuits; arbitrary functions are possible. 

The following trick can facilitate the modeling of the radiation impedance 
[A. 8]. The spherical waves outside the mouth are approximately described in 
an extended tube model with quadratically increasing area. At its end, the 
radiation impedance for the larger area is less critical to model. This extended 
tube also allows the modeling of lip protrusion without changing the total 
number of segments. 



Wave Representation. The chain matrices are here replaced by transmis- 
sion matrices: 



’Pn’ 




Pn. 


in 



exp( 7 „d„) 0 

0 exp(- 7 „d„) 





1 




[Pn\ 



(A.52) 



and likewise for the ^ or ip waves. The segment boundaries also have trans- 
mission matrices, representing reflection and transmission: 

n+ 





■ 1/(1 +r„) r„/(l +r„) 


'Pt+i' 


out 


_r„/(l +r„) 1/(1 + r„) _ 


Pn+l. 



(A.53) 



For q waves, (1 -h r^) is replaced by (1 — r^), and for '0 waves, by 7 /I — . 

In order to obtain the total transmission matrix, these matrix pairs have to 
be multiplied up again. With the ip waves and no losses, the transmission 
matrix is G SU(1,1), preserving the power 

An impedance, e.g. , connected to the output of the last (Nth) seg- 
ment is represented by a complex reflection factor r^ad = (^rad — ^Ar)/(^rad + 
Zn) , so that = ^radPnout likewise for the q and 'll; waves. If the 

~V Q] 

the reflection factor at the tube input 



total transmission matrix is 



7^ 5 



and the input impedance are obtained: 

Hn = Pin/ Phi ~ ‘^rrad )/ ipP + Q^ad) ? ^in = 



1 



1 + ni 



(A.54) 



where Z\ is the characteristic impedance of the first segment. The volume- 
velocity transfer function is, for p waves, 

Hq — Z\j ((P — lZ)[Zj:ad + ^ n) + (Q ~ S){Z^ad ~ ^n)) • (A. 55) 

(For q waves, replace Zi by Zn in this equation, and for ip waves, by y/ZiZ^ •) 



Time Discretization and z- Transform. From the frequency-domain 
equations for the lossless (or at least dispersion-free) case and common seg- 
ment length dn = d, we can immediately derive the z-transform equivalents 
with respect to the sampling frequency /s = c/2d. Simply set z = exp(iu;//s). 
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Then the chain matrix and the transmission matrix of a segment become, for 
the lossless case, 

[ 2^/2 n 1 

and Q ^_i /2 , (A.56) 

respectively. Thus the elements of the total matrices are polynomials in 
times a total advance factor 2 ^/^. The matrix multiplication gives recursion 
formulas for their coefficients. It is further easily verified that the total lossless 
transmission matrix shows the time-reversal symmetry 

P(z) = 5(2-1), ^ _ (A.57) 

This means the polynomial coefficients of V, Q are those of 5, 7^ in reverse 
sequence, which halves the computational expense in computing the trans- 
mission matrix. If there is a common constant wave damping factor a < 1 
per segment, needs simply be replaced by az~^!^ . 

Whereas the z representation of the tube itself is exact, the forms for 
attached discrete impedances must be approximated by the usual methods. 
For wave quantities, the wave-digital-filter approach is the most natural one, 
as stated in Sect. A.3.1. 

From the ^;-transform expressions, we can immediately go back to the 
time domain by interpreting as a delay by d/c, i.e. k ^ k — 1 using the 

double-rate time indexing of Sect. A.3.1, and get the difference equations in 
space and time again. 



A. 3. 3 Tube Models and Linear Prediction 



It is well known from linear-prediction theory [A. 9] that a discrete-time all- 
pole filter H{z) — 1/A(z), A(z) = = 1, can be realized, 

up to gain and delay factors, as the transfer function of a segmented lossless 
tube as used in the previous sections. This can be seen from the relations 
between “PARC OR” coefficients km and the predictor coefficients , which 
lead to a realization of A(z) as a nonrecursive lattice filter and of 1/A{z) as a 
recursive one. Define auxiliary signals e+ , e“ (called forward and backward 
prediction errors). The lattice filter is then recursively defined as 



-m-l ^m-1 ’ 

- fc„e+-i > 



If is the input and ej the output (e^ discarded), this implements \jA{z). 
Now compare this to the transmission in a tube. Combining (A. 53) and 
(A.56), this reads for the p waves at the segment oi^^puts: 

(1 + rn)pt ^ , 

(1 + rn)Pn = Z~^^^Pn+l + rnZ^^‘^Pt +1 ■ 



(A.59) 
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We now make the following identifications: 

A^-l 

m = N -n , = ±p^ II > (A.60) 

v—n 

and assume a short-circuit at the tube output and a source of 

real, constant impedance Zq at the input. This is equivalent to inserting a 
segment 0 of impedance Zq before segment 1 whose pg is absorbed. The 
is taken as the input signal, as the output signal. It is then obvious that 
(A. 58) and (A. 59) become identical. [For q ov 'ip waves, the factors (1 Vn) 
have to be replaced as mentioned after (A. 53).] Note that the factors in 
(A. 59) mean that for the signals e^, e~ , an oblique space-time coordinate 
system (as briefly mentioned in Sect. A. 3.1) has been used, where right- 
traveling waves have a constant time coordinate. This combines a pair of 
delays in the right and left traveling waves into ordinary z~^ delays in 
the left wave only, as known from the lattice Alter. 

But this way to identify (A. 58) and (A. 59), originally proposed by Wakita 
[A. 10], is not the only one. An earlier but less well-known interpretation using 
different boundary conditions was given by Atal and Hanauer [A. 11]. Here 
we assume a “hard” input of the tube with a volume- velocity source gin • The 
radiation impedance at the output is real and constant, so that formally a 
segment N 1 of this impedance with reflection factor = Vrad can be 
inserted, with p^_^i = 0 and Pat+i = p)v+i output signal of the total 

system. Instead of deriving the formulas from the matrices, we can obtain the 
result from the previous analogy, employing the reciprocity relations (A. 48b) 
and duality (A. 11), as visualized by the electrical equivalent circuits in Fig. 
A.4. 




Fig. A. 4. Reciprocity and 
duality transformation be- 
tween tube models related to 
the PARCOR lattice. Top: 
Wakita’s boundary condi- 
tions; middle: reciprocity ap- 
plied and tube mirrored; bot- 
tom: duality applied, giving 
Atal’s boundary conditions 
(Ziv+i oc Zq“^). Source is al- 
ways left, output right 
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To describe the Wakita system in the (p, q) domain, we replace the formal 
segment 0 by a pressure source po = 2pg series with a resistance Zq and 
assume a volume-velocity output qn — (p^ ~p'^)IZn = 2pI^!Zn (Tig. A. 4, 
top), just scaling the transfer function hy Z^ . According to the reciprocity 
theorem, the transfer function is then identical to that of a volume- velocity 
output in (formal) segment 0 with respect to a pressure source in segment N 
(Fig. A. 4, middle). Now we can invert the segment numbering, n — > A^ + l — n, 
so that Tn —TN-n • Ncxt, wc apply the duality transformation, changing 
the zero-impedance pressure source into an infinite-impedance current source 
and the volume- velocity output into a pressure output (Fig. A. 4, bottom); 
furthermore, this transformation inverts all areas, whereby the change sign 
again and are identical to the original ones, apart from numbering. The total 
area transformation is consequently An , n = 1, . . . , . 

Thus the PARCOR lattice filter can be identified with an acoustic tube in 
two different ways, depending on the physical boundary conditions assumed. 



A. 4 Notes on the Inverse Problem 

By inverse problem we mean, generally speaking, the determination of geo- 
metric properties of the vocal tract from acoustic data. The former can be de- 
tailed area values, but also the values of articulatory parameters. The acoustic 
data may comprise formant frequencies and band widths, spectra, autocorre- 
lation or LPC coefficients, or impedance data. We can only briefly present 
some selected approaches. Area values are directly related to the transfer 
properties of the tract and therefore people have tried to determine them 
on theoretical grounds, whereas articulatory data require empirical mapping 
methods. 

A. 4.1 Analytic and Numerical Methods 

It is easily imaginable that any knowledge related to a finite frequency range 
will not give information about arbitrarily fine structure of the area function. 
But even if we knew all the (infinitely many) formant frequencies of the 
tube, this would only provide half the information required. This is easily 
seen from the area-perturbation results for a lossless tube [A. 2] as described 
in Sect. A. 2. 3. Only the odd cosine-series components of the logarithmic area 
function can be determined from the formant frequencies for small deviations 
from a uniform tube. Consequently, two sets of data are needed. These may 
be 

(1) formant frequencies for different boundary conditions - hardly available 
in the vocal tract; 

(2) poles and zeros of an input impedance, for instance, measured at the 
lips from outside [A. 2], although this impedes articulation and forbids 
phonat ion; 
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(3) formant frequencies and bandwidths for a lossy termination of known 
form - difficult to measure exactly enough; 

(4) a piece of the autocorrelation function of the impulse response of the 
tract with lossy termination (or the corresponding LPC coefficients), ap- 
proximated by analysis of actual speech with proper preemphasis. 

In fact, these possibilities are essentially equivalent. For (3) and (4), this 
is clear, both expressing the transfer function of the tube with lossy ter- 
mination. As for (1) and (2), the poles and zeros of the input impedance 
correspond to the resonances for hard and soft termination. Furthermore, 
the input impedance is related to the transfer or autocorrelation function, as 
will be shown below. 

Let us start with approach (2), the construction of the area function from 
the input impedance. This is feasible in the frequency domain, but requires 
knowledge about length and termination at the other end. By identifying the 
poles and zeros with the formants for hard and soft termination, perturbation 
methods as in [A. 2] are applicable. The missing higher formants can be filled 
in using the asymptotic eigenvalue theorems for Sturm-Liouville systems and 
choosing the unknown length so that log A{x) deviates minimally from a 
uniform tube [A. 12] 

However, the reconstruction from the input impedance is much more com- 
prehensive in the time domain. We consider the lossless tube only. At time 
t = 0, apply a volume- velocity delta pulse to the tube input and record the 
pressure response. This is the inverse Fourier transform (or z-transform, re- 
spectively) Cin{t) of the impedance or For reasons of causality, 

the interval [0, r] of this impulse response contains information about the 
length interval [0, cr/2]; in discrete time, the samples 0, . . . , n contain infor- 
mation about the first n -h 1 segments. In the continuous case, the solution 
uses integral equations [A. 13], whereas in the discrete case, the segment areas 
are obtained by a simple recursion derived from (A. 43). Even the length of 
the tube need not be known. It might be estimated from the derived tract 
shape itself. 

Instead of using the input impulse response directly, we can (for the dis- 
crete lossless tube) employ the linear-predictive tube model from Sect. A. 3. 3 
to compute the area function. Since a current source was assumed, Atal’s 
boundary conditions apply. For a tube of at least M segments to be deter- 
mined, formally assume termination of the Mth segment by an arbitrary 
real constant conductance Gtrm , where the pressure is ptrm = PMout • The 
equality of input and output powers reads, in the frequency domain, 

Re(Zin)|^inP = Gtrmiptrmp • (A. 61) 

Note that for a unit pulse, l^inP = 1 • Transforming back to the time domain, 
this yields 

Cin(^) H” Cin(~^) — 2GtrmRptrm (^) 5 



(A.62) 
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where is the autocorrelation function of • Thus - in discrete 

time - the samples 2(^in(0), Cin(l), • • • , Cin(^ — 1) (independent of the actual 
termination of segment M) can be identified with a piece of a formal ACF, 
from which then M — 1 reflection coefficients are obtained by the usual LPC 
techniques. This yields Zn or An, n = 1, . . . , M , up to an unknown factor, 
which may be determined from Z[n{t = 0) = Zi . 

Now let us consider approach (4), the well-known construction of a seg- 
mented tube model by linear prediction, already described in Sect. A. 3. 3. We 
can start with an autocorrelation function (ACF) or a covariance matrix of 
the speech output, from which the PARC OR coefficients are derived [A. 9] 
and then interpreted as reflection factors between the segments of a tube. 
However, the sequence of these reflection factors depends on the assumed 
boundary conditions, as described above! Only if the logarithmic area func- 
tion is antisymmetric about its midpoint, do the two cases considered yield 
the same result. In fact, AtaPs boundary conditions of a hard glottis and 
constant real termination {Zn-\-i in Fig. A. 4, bottom) can be generalized to 
a more realistic radiation admittance with constant real part and arbitrary 
imaginary part without changing the result. Look at (A. 62) with M = A'-hl , 
Gtrm = ^iv +1 • Equation (A. 62) still holds if Gtrm is the constant real part of 
some complex admittance Itmi • The conductance-inductance parallel model 
of the radiation admittance mentioned in Sect. A. 3. 2 is of just this form! 
As the termination does not influence the initial portion of the impulse re- 
sponse Z[n{t) at the glottal end and (A. 62) is independent of the imaginary 
part of Ytrm , the latter does not affect the initial portion of the ACF of 
the output pressure and thus the PARCOR coefficients. These can still be 
interpreted as the same reflection coefficients. But the absolute area scaling 
remains unknown, because Gtrm of the radiation load is itself approximately 
proportional to the lip area. The relations between input impulse response 
and output ACF were first presented by Atal [A. 14]. 

As nice as these relations look, they have not led to a reliable determina- 
tion of the vocal-tract shape from the speech output. We have mentioned the 
extreme sensitivity of the results to the physical boundary conditions. Fur- 
ther, the actual transfer function of the vocal tract is not measurable, since its 
excitation is not a sequence of delta pulses. The glottal pulse shape or its spec- 
trum show considerable variation. Moreover, the source-filter approximation 
is not exact; the oscillating glottis acts as a time- varying source impedance. 
Attempts to use adaptive preemphasis or LPC analysis on closed- glottis inter- 
vals (which are already difficult to determine) may yield reasonable-looking 
results. Another unknown factor is the distribution of losses and wall admit- 
tance, not accounted for in the above approaches. For specific simplifying 
assumptions a solution based on integral equations has been achieved [A. 5]. 




268 A. Acoustic Theory and Modeling of the Vocal Tract 



A. 4. 2 Empirical Methods 

These difficulties have led researchers to deliberately renounce reliance on 
complete acoustic data or analytic methods. Even in analytic approaches, 
incomplete information is often replaced by ad hoc assumptions; for instance, 
assume the logarithmic area function to be odd [A. 2], assume minimum least- 
squares deviation from a uniform tube [A. 12], guess the formant bandwidths 
according to empirical relations, etc. 

Instead of trying to reconstruct the area function directly, one can de- 
crease the number of unknowns by expressing the area function by some 
parameters. The cosine series of [A. 2] is a simple example; but more promis- 
ing appears to be the use of parameters from an articulatory model. This 
might automatically restrict the possible area functions to anatomically rea- 
sonable ones. However, such parameters cannot be determined by closed-form 
algorithms any more. Instead, empirical tables, vector quantizers, or neural 
networks must be employed to “learn” the relation between acoustic and ar- 
ticulatory data for the model by presenting many (synthetic) examples. An 
early large investigation, also considering nonuniqueness, was given in [A. 15]. 
Accounting for temporal continuity constraints of the parameter trajectories 
might be helpful. Since such methods are not directly related to the acoustic 
theory of the vocal tract, they are beyond the scope of this appendix. 




B. Direct Relations 

Between Cepstrum and Predictor Coefficients^ 



Here we derive direct, i.e., nonrecursive, relations for the cepstrum in terms 
of the predictor coefficients and vice versa. Connections with algebraic roots, 
symmetric functions, statistical moments, and cumulants are pointed out. 
Some implications for pitch detection are also discussed. 

Recursive relations between cepstrum and predictor coefficients [B.l] have 
long been known [B.2]. For some purposes, knowledge of direct relations be- 
tween these two sets of important parameters characterizing sources and sig- 
nals is desirable. 



B.l Derivation of the Main Result 



Let 

p 

A{z) \ uo = 1 ; Up 0 (B.l) 

k=0 

be an “inverse filter” polynomial [B.2] of order p whose roots are inside the 
unit circle. The ak are the predictor coefficients. Then llA{z) is a (stable) 
all-pole filter whose cepstrum coefficients Cn are customarily defined by 

00 

In [ 1 /^( 2 :)] =: . (B.2) 

n=l 

The well-known recursion relation between the ak and is obtained by 
differentiating (B.2) with respect to z~^ and equating equal powers of z~^, 
yielding [B.3] 



Cn — C'n 



n— 1 

^ ^ kcka-n—k 

k=l 



(B.3) 



A direct (nonrecursive) relation can be obtained by applying a formula [B.4] 
for the division of two power series to the ratio —A'{z)/A{z) obtained after 
differentiating the left side of (B.2). This gives 

^Adapted from IEEE Trans. Acoustics, Speech, and Signal Processing, ASSP- 
29 , 297-301 (1981). 
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c„ = -i-ir 

n 



ai 1 0 

2tt2 ai 1 0 



(B.4) 



I TlCifi CL^i—l * ' * (^1 I 

Unfortunately, this determinant is somewhat unwieldy. An alternative direct 
form is, therefore, desirable and can be derived as follows. From (B.l) and 
(B.2) we have 



In ( 1 + \ = - '^CnZ " • 

V fc=l / n=l 

Using the well-known power series expansion for ln(l -f- x) yields 

oo ^ / p ' 



si -5: 



akZ 



-k 



m=l 



or [B.5] 



k=l 



■E 

n=l 



CrtZ 



flm'f 

2 ^ rn 2 ^^ 2 ^ 

m=l n=m 



ki\---kp\ 



E' 

n=l 



where the third sum has to be taken over all 
ki + 2 k 2 + . . . + pkp = n 
and 



(B.5) 

(B.6) 

(B.7) 

(B.7a) 



ki -f- A)2 “h . . . H“ kp — TTi . (B.7b) 

Because m is summed over all positive integers, the condition (B.7b) can 
be dropped if m in (B.7) is replaced by A:i + /c 2 + . . . + /cp. 

Equating equal powers of in (B.7) then yields the desired direct rela- 
tion between cepstrum and predictor coefficients 



E 



(k\ + ^2 + • • • + fcp — 1)! 

ki\...kp\ 



...{-app 



(B.8) 



where the sum is to be taken over all kr that fulfill (B.7a). 

What does the restriction (B.7a) on the sum in (B.8) mean? Assume that 
n = A and p > A. Then (B.7a) can be satisfied by the following five choices 
of the ki 



ki /c2 ks k4 

4 0 0 0 
2 10 0 
10 10 
0 2 0 0 
0 0 0 1 



In addition, all ki with i > A must equal zero. 
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Since ki is multiplied by i in (B.7a), we can also say that the different 
ki are “counted” i times in adding up to n. In other words, each row of 
the above table corresponds precisely to one decomposition of n into positive 
integers: 

number of I’s 2’s 3’s 4’s 

4 0 0 O" 

2 10 0 
10 10 
0 2 0 0 

0 0 0 1 

Thus, the number of terms in (B.8) equals the number of partitions P{n) of 
n into positive integers not exceeding p. The generating function for P(n) 
is 

(B.8b) 

n=l 

a result that can be verified by expanding each term of the product into a 
geometric series. 

The restricted partitions P(n) are related to the unrestricted partitions 
p{n) [B.6] by the formula 

n—p—l 

P{n) = p{n) - ^ pit) (B.8c) 

2=0 

where p(0) is defined to equal 1 and the empty sum is considered to be zero, 
i.e. for n < p, P(n) = p{n). Equation (B.8c) is proved by observing that for 
n = p + 1, P(n) = p{n) — 1 and by complete induction. 



(1 + 1 + 1+1 =4) 
(1 + 1 + 2 =4) 
(1 + 3 =4) 
(2+2 =4) 
(4 =4) 



(B.8a) 



B.2 Direct Computation of Predictor Coefficients 
from the Cepstrum 

Prom (B.l) and (B.2) we have 

p 

= exp 

n=0 



■E 

k=l 



CkZ 



-k 



Expanding the exponential function into a power series results in 



p 

n=0 



^ — k 



Evaluation of the mth power [B.5] yields 



(B.9) 



(B.IO) 
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p oo oo 

dnZ ^ ^ ^ 2 ; 

n=0 m=0 n—m 

where the third sum on the right is to be taken over 

ki + 2k2 + h nkji = n (B.lla) 

and 

“t“ ^2 “I” ' ' ■ ~ • (B.llb) 




k\ \ ’ ' ' kn 



'^n • 



(B.ll) 



Because of the sum over m in (B.ll), the subsidiary condition (B.llb) is 
obviated. Equating equal powers of z~^ gives the desired direct relation for 
the predictor coefficients in terms of the cepstrum 



^n 



^ ki\--^kn\ 



(B.12) 



where the sum is to be taken over all kr subject to (B.lla). 



B.3 A Simple Check 



With Ck = —1/^, the third sum in (B.ll), summed according to (B.lla) and 
(B.llb), equals [B.7] l/{n\) times the number of permutations of n objects 
which have exactly m cycles [B.8]. Thus, the sum over m in (B.ll) must equal 
1 for all n. Hence, 



an == 1 (n = 1, • • • ,p) if Cfc = (A: = 1, • • • ,p) . (B.13) 

Equation (B.13) and its (generalized) inverse 

Cn = (n = 1, • • • ,p) if ak = {k = Ir - ,p) (BT4) 

n 

also follow directly from applying the summation formula for geometric series 
to A{z)^ 



A{z) = 1 -f ^ H h q^z ^ — 



I _ ^p+i^-(p+i) 
1 — qz~^ 



and expanding In A{z) into a power series. 



B.4 Connection with Algebraic Roots 
and Symmetric Functions 

Equation (B.8) can also be derived as follows. If are the roots of A(z), 
then 
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p 

A{z) = ]^(1 — ZrZ~^) (B.15) 

r=l 

and 

P oo -J 

HI/ A{z)] = • (B.16) 

Til 

r—1 m—1 

By inverting the order of summation one has (for uniform convergence of the 
sum over m, i.e., for \zrZ~^\ < 1) 

oo ^ 

H^/A{z)] = , (B.17) 

777 / 

m=l 

where the 

p 

Rm = Y.z^ (B.17a) 

r=l 

are the “root-power sums” of A^z^). 

By equating equal powers of in (B.2) and (B.17) one obtains a relation 
between cepstrum and root-power sums, 

Cn = -Rn . (B.18) 

n 

Equation (B.8) then results from Warring’s formula [B.9] for the root-power 
sums Rm in terms of the polynomial coefficients ak- 

It is interesting to note that the recursive relation (B.3), when the cep- 
strum coefficients Cn are replaced by the root-power sums Rn using the iden- 
tity (B.18), was already known to Newton [B.IO]. 

According to Vieta’s root theorem (see p. 102 of [B.8]), the relation be- 
tween the predictor coefficients ak and the roots is as follows, 

Zi Z2 Zp = — Ui 

Z1Z2 + Z1Z3 H h Zp-iZp = a 2 

(B.19) 



Z1Z2 • • • Zp = (-l)^ap . 

Here the left-hand sides are the complete set of elementary symmetric func- 
tions of the roots Zr and, as (B.19) shows, they are equal, to within a factor 
of ±1, to the predictor coefficients ak. (Symmetric functions are defined as 
functions that do not change when the variables are arbitrarily interchanged; 
see p. 138 of [B.8].) 

On the other hand, the root-power sums 

zl" + z^ + ... + z;^ = i?^ (B.19a) 

are also symmetric functions of the roots z^^, albeit not of the elementary 
type. All symmetric functions can be expressed in terms of the elementary 
functions (p. 138 of [B.8]) and, in fact, we find our result (B.8) in the literature 
on symmetric functions [B.ll]. 
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B.5 Connection with Statistical Moments 
and Cumulants 

Let f{x) be a probability density function. Its characteristic function F{y) is 
then defined as 

F{y) ^ I f{x)F-ydx , (B.20) 

or, by expanding the exponential, 

= l + (B.21) 

k=l 

where the are the “moments” of f{x). The “cumulants” Kn are then 
defined as the coefficients in the power series expansion of \nF{y) as follows: 

oo \k 

lnF(j/) = ^«fc^. (B.22) 

If we identify iy with z~^ and F{y) in (B.21) with A{z) in (B.l), then we 
see that the predictor coefficients correspond to and the cepstrum co- 

efficients to —Knjn\. Our main result (B.8) is then deduced by invoking the 
relation between statistical moments and cumulants [B.l 2]. 



B.6 Computational Complexity 

The number of terms needed to compute directly from the ak equals P(n), 

the number of restricted partitions of n (see above). This number may be 
further reduced by the following observation. It is known in probability theory 
that the number of terms needed to represent the cumulant Kn by central 
moments fik (moments about the mean) is much smaller than the number 
of terms required by ordinary moments /x'^. In fact, all terms containing /x'^ 
disappear (because /xi =0). 

How can we translate this saving to the direct computation of the cep- 
strum coefficients Cn from the predictor coefficients a^;? We need a transfor- 
mation that makes a\ (which corresponds to jji[) equal to zero. In probability 
theory, the required operation is a shifting of the distribution function f{x) by 
—y'l or, equivalently, a multiplication of the characteristic functions F{y) by 
e-iMi?/ Since F{y) corresponds to A{z) and \y to z~^ , the modified predictor 
“polynomial” is 

A{z) = (B.23) 

or by expanding the exponential 

'P ^ ^ \m 

A{z) ^ E 

k=0 m=0 
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Introducing “modified predictor coefficients” CLk defined by 

oo 

A{z) = (B-25) 

one obtains, by multiplying the two sums in (B.24), 



ak 



E' 

m=0 



(-«i) 



k — m 



{k — m)\ 



(B.26) 



with ao = 1 and ai =0, as expected. 

The saving in computation, once this transformation has been performed, 
can be substantial. The number of terms required to express in terms of 
the Gik equals p{n) — p{n — 1), where p{n) is the number of i^nrestricted par- 
titions of n. (Because the degree of A{z) is unlimited, the partitions become 
unrestricted and precisely p{n — 1) of the p{n) partitions of n contain the 
integer 1 corresponding to the vanishing a\.) For example, for a predictor 
polynomial A{z) of order p > 9, cg is given by 30 terms in the ak but re- 
quires only eight terms in the ak- By contrast, the recursive computation of 
Cg involves 45 terms. However, if all cepstrum coefficients up to n have to be 
calculated, the recursive formula becomes advantageous for n > 13. (That a 
crossover in efficiency occurs for some finite n follows from the fact that the 
number of partitions grows exponentially as n goes to infinity whereas the 
number of recursive terms grow only polynomially.) 

The change from ak to dk has a very simple effect on the logarithm of the 
Fourier transform or “log-spectrum” L{lu) defined by 

L{uj) = ln[^(e‘“^)] . (B.27) 



With (B.23) the new log-spectrum is 
L{lo) = L{oj) - aie“‘“^ , 



(B.28) 



which is just the original log-spectrum minus its fundamental “quefrency” . 

Thus, to compute the cepstrum, one can use (B.8) as before, with the dk 
replacing the ak and remembering that ci = — ai. Since di = 0, most terms 
drop out - which was the purpose of the transformation. 

Because of Vieta’s theorem and di = 0, the modified polynomial A{z) 
has a vanishing sum of roots. This (or the fact that L{lu) has no fundamental 
“quefrency” ) may be useful for signal spectrum preemphasis. 



B.7 An Application of Root-Power Sums 
to Pitch Detection 

Several years ago, Atal [B.13] proposed a method for pitch detection (fun- 
damental frequency measurement of a speech signal) that is closely related 
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to the root-power sums. In this method, one determines the predictor coeffi- 
cients ak for a speech segment (low-pass filtered at IkHz, say) of ca, 40 ms 
duration. The order of the predictor is relatively high (p 40). As a result, 
for voiced speech sounds, the linear-prediction spectrum approximates the 
harmonic fine structure of the spectrum. In other words, the pole frequen- 
cies of the predictor polynomial represent the fundamental frequency and its 
harmonics - and not primarily the formant frequencies as in low-order linear 
prediction. 

If we write 

= 6*“’-'^ , (B.29) 

where T is the sampling time interval and ujr the complex frequency of the 
rth root of A{z)^ then the root-power sums are 

p 

. (B.30) 

r=l 

Thus, if the index m is considered as representing time, the Rm are the sums 
of p sampled complex exponentials, all being added with equal weight and 
zero phase at time zero (corresponding to m = 0). 

Because the significant terms in (B.30) represent the harmonic frequen- 
cies, the magnitude of Rm^ if AfT equals the pitch period, will be relatively 
large. The reason is that for m = M all such terms in (B.30) add in phase 
again (the way they started out at m = 0). If p exceeds the number of har- 
monics, then the excess terms in (B.30) will not necessarily coincide with 
any harmonic frequency but such excess terms will also have large imaginary 
parts of (jJr (i*e., will be highly damped) so that they contribute little to 
the value of Rm- This method has been included in a comparative study of 
various pitch detectors [B.14]. 

There are similarities between pitch detectors based on root-power sums 
and those based on autocorrelation (“matched filter”) analysis. However, an 
important distinction between the two is that in the autocorrelation func- 
tion individual components (“harmonics”) are added with their amplitudes 
squared while in the root-power sums, the amplitudes of all terms are equal. 
Thus, root-power sums act as spectrum flatteners (“inverse filters”), thereby 
avoiding the problems in pitch detection resulting from nonflat spectra due 
to formant structure, vocal source spectrum, lip radiation, etc. [B.15]. 

How do cepstrum pitch detectors fit into this picture? Because of the 
identity Cm = Rm/'^i pitch detectors based on the cepstrum and root-power 
sums are very similar. In fact, they are identical except for the weighting 
factor m. Both eliminate the formant (and spectrum envelope) structure from 
the signal spectrum and thereby enhance the spectrum fine structure that 
contains the pitch information. But in the cepstrum the higher “quefrencies” 
are attenuated by a factor 1/m. This is not necessarily advantageous. In 
fact, it was found that multiplying the cepstrum by the “quefrency” m 
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often gives better results. It is, therefore, legitimate to say that cepstrum 
pitch detectors work well, especially when a quefrency weighting factor m is 
included because of the close connection with root-power sums which act as an 
inverse filter flattening the spectrum and setting the phases of all components 
to zero. 
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cil, Nottingham, U.K. 
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nische Universitat Darmstadt, Germany. 
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A Sampling of Societies and Major Meetings 

Eurospeech. European conferences on speech communication and technol- 
ogy, organized by the European Speech Communication Association 
(ESCA). 

ICASSP. IEEE International Conferences on Acoustics, Speech, and Signal 
Processing. 

AES Conventions. Meetings of the Audio Engineering Society. 

ICA. International congresses on acoustics, held every three years, orga- 
nized by the International Commission on Acoustics (United Nations). 

ASA Meetings of the Acoustical Society of America, held twice a year, with 
extensive coverage of speech and hearing. 

ICPhS. International Congresses of Phonetic Sciences. 

ICSLP. International Conferences on Spoken Language Processing. 

Conferences of the International Neural Networks Society. 

DAG A. Annual meetings of the German Acoustical Society (DEG A). 

DAD. Danish Acoustical Days, organized by the Danish Acoustical Society 
(DAS). 

SEA. Societe Frangaise d’ Acoustique, holds frequent meetings. 

lOA. (British) Institute of Acoustics. 

EEAA. East-European Acoustical Association. 

EAA. European Acoustics Association. 

ACL. Annual meetings of the Association for Computational Linguistics. 

ELSNET. European Network for Language and Speech. Maintains a WWW 
page containing a list of Speech and Natural Language events with con- 
tact addresses, http://www.elsnet.org/conferences/ 

EUSIPCO. European Signal Processing Conferences. 

ICANN. International Conferences on Applications of Neural Networks. 

IEEE Workshops on Interactive Voice Technology. 

SPECOM. International Workshops “Speech and Computer.” 

KONVENS. Conferences on Natural Language Processing. 

International Workshops on Speech Synthesis. 




Glossary of Speech and Computer Terms 



The beginning of wisdom is the definition of terms. 

Socrates (4709-399 BC) 



This is not the end. It is not even the beginning of the end. 

But it is perhaps the end of the beginning. 

Winston Churchill (10 November 1942, after the victory at El Alamein) 



Terms in italics are explained in their respective alphabetical entries. 



ACELP adaptive code- excited linear prediction. 

activity model in speech dialogue systems: the information layer that rep- 
resents the functionality of the linked applications. 

A/D analog-to-digital converter. 

adaptive differential pulse code modulation differential pulse code mo- 
dulation in which the quantizing steps adapt to the signal. 

adaptive predictive coding (APC) early name for linear predictive cod- 
ing emphasizing the adaptive nature of the predictor for speech signals 
(as opposed to the fixed predictors used for image coding). 

ADPCM adaptive differential pulse code modulation. 

Advanced Encryption Standard a new standard that uses flexible, larger 
block and key sizes than its predecessor, the DES. 

Advanced Research Project Agency agency of the U.S. Defense depart- 
ment supporting research in speech and other fields. 

AES Advanced Encryption Standard. 

AI artificial intelligence. 

algorithm an explicit, step-by-step program or set of instructions for getting 
the solution to some problem. 

algorithmic computing see programmed computing. 
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aliasing generating extraneous frequency components by undersampling a 
signal. 

alignment model in the statistical approach to automatic translation, a 
model that describes at what position in the output sentence a generated 
word of the target language is placed. 

allpass filter mechanical or electrical device that transmits all frequencies of 
a signal equally well and therefore does not change its amplitude spectrum. 

allpole transmission medium, such as a filter, that has no zeroes in its trans- 
fer function. 

America Online (AOL) an Internet service provider. 

amplitude range (of a signal) difference between highest and lowest am- 
plitude values of a signal. 

amplitude spectrum magnitude (absolute value) of the Fourier transform, 
also called spectrum. 

analog capable of assuming a continuous range of values (such as the hands 
of a clock) — as opposed to digital. 

analysis-by-synthesis the synthesis of several trial versions of a signal and 
choosing the best match to a given signal. 

anaphora in an anaphora, an expression references information from a pre- 
vious sentence without explicitly reformulating the entire phrase but by 
means of pointers (usually pronouns) such as “his,” “this,” or “there.” 
Example: “Let’s go to the beach . It’s always nice there.” See also refer- 
ence resolution. 

anechoic room an acoustic space without echoes or reverberation, used for 
acoustic tests. 

angular frequency frequency multiplied by 2tt. 

anti-causal passive device that produces an output only before an input 
is applied and zero output thereafter. The inverse of an allpass filter is 
anti-causal. 

any key any one of the keys of a (computer) keyboard. Not a special key 
called “Any.” 

APC adaptive predictive coding. 

APCM adaptive pulse code modulation. 

aphasia the loss or impairment of language abilities usually following brain 
damage. 

Apple Macintosh computer. 

applet literally: little application. A small program which may be started 
through an applet viewer or web browser, and which has strictly limited 
access (e.g. read/write restriction on hard disk) to the host system. 

application a computer program designed for a specific task or use, like 
word processing, accounting etc. 

AR autoregression. 

ARM A AR followed by MA: autoregressive analysis combined with moving 
average of data. 




Glossary 311 



ARPA Advanced Research Project Agency. 

ARPANET forerunner of the Internet, linking military sites, defense con- 
tractors and universities. After 1983 mostly nonmilitary uses. 

articulation the movements of speech organs involved in producing a 
(speech) sound. 

articulator movable organs (tongue, lips etc.) involved in the production of 
speech sounds. 

articulatory feature property of speech sound such as voicing, nasality, 
bilabial, and place of articulation in the vocal tract. 

artificial intelligence the attempt to program computers to carry out in- 
telligent tasks such as learning, reasoning, recognizing objects, under- 
standing speech, and moving arms and legs. 

ASL American Sign Language, the primary sign language for the deaf in the 
United States. 

ASR automatic speech recognition. 

asynchronous transfer mode a standard that allows the transmission of 
data, voice and video in real time. 

ATM asynchronous transfer mode. 

AT&;T American Telephone and Telegraph Company, the former mother 
company of the defunct Bell System. 

autocorrelation normalized average of signal multiplied by the delayed sig- 
nal. 

automatic speech recognition automatic recognition (usually by com- 
puter) of speech signals for speech-to-text systems. 

autoregression (statistical) linear regression analysis based on prior data 
values. 

back door a secret way to enter a computer that bypasses normal security 
procedures. 

backpropagation through time a popular algorithm to train recurrent 
spatiotemporal neural networks, an extension of the standard backpropa- 
gation algorithm. 

backup copy of a file that is kept in case the original is lost. 

Backus— Naur form a common grammar description formalism, equivalent 
to a context-free grammar. 

back(w^ard) propagation algorithm adjustment of weights in a multi- 
layer neural network beginning with the output layer and working back- 
ward to the input layer. 

bandpass filter a mechanical or electronic device that lets only intermedi- 
ate frequencies pass through and blocks lower and higher frequencies. 

bandwidth the width in frequency of a communication channel or filter 
that, together with the signal-to-noise ratio, characterizes its information 
carrying capacity. 




312 Glossary 

Bark unit of a frequency scale based on subjective pitch. 1 Bark, named 
after Heinrich Barkhausen (1881-1956), corresponds to a bandwidth of 
about 100 Hz below 600 Hz and about 1/6 of the (center) frequency 
above 600 Hz. The frequency range of normal human hearing (20 Hz to 
20 000 Hz) corresponds to 24 Bark. The Bark scale is linear along the 
basilar membrane in the inner ear, 1 Bark corresponding to 1 mm. 
baseband low-frequency components of a signal. For speech, typically, the 
frequency components below 2000 Hz. 

basilar membrane membrane in the inner ear along which sound waves 
travel. 

baud bits per second: rate of information transmission. A 56-kbaud modem 
can handle information up to 56 000 bits per second. 

BDI model belief/ desire/intention model Describes the cognitive state of 
a communicating agent, see p. 88. 

Bell (Telephone) Laboratories the research laboratories of AT&T^ later 
of Lucent Technologies. 
bigram see n-gram. 

binaural masking level difference ability of human hearing to perceive 
tones that are up to 20 dB weaker than in the corresponding monaural 
situation. See also cocktail-party effect. 
bit basic unit of binary information, a simple alternative, such as yes/no, 
0/1, on/off etc. (Pun created by J.W. Tukey.) 

BMLD binaural masking level difference. 

BNF see Backus-Naur form. 
boot to start a computer. 

Bronx cheer a loud, spluttering noise made with the lips and tongue to 
express contempt, 
browser web browser. 

bug a defect or imperfection in a machine or computer program, 
bundling a marketing strategy to promote weak or new products by ship- 
ping them with a popular, established or essential product, 
bus circuit that connects the central processing unit with other devices in a 
computer. (From “bus bar” in electrical power engineering: heavy-duty 
conductor to distribute electrical currents.) 
byte eight bits, corresponding to 2^ = 256 possibilities, such as the 256 
different characters (“letters”) of a computer font. One byte therefore 
corresponds to one character. (Another pun, this time based on bite: a 
big bit). 

cache cache memory. 

cache memory a portion of memory in which frequently used information 
is duplicated for quick access. 

CAD computer aided design. 




Glossary 313 



Caltech California Institute of Technology at Pasadena. Home of the Jet 
Propulsion Laboratory. 

causal passive physical device that does not produce an output before an 
input is applied. 

CD compact disc. 

CD-ROM read-only memory on a compact disc. 

CELP code- exited linear prediction. 

center clipping setting the smallest values of a signal equal to zero. Center 
clipped speech is difficult to understand. See also peak clipping. 

central processing unit main component of a computer that interprets 
and executes program instructions. 

cepstrum Fourier transform of the logarithm of the spectrum of a signal. 

CFG context-free grammar. 

chart parser a bottom-up parser, see p. 93. 

chatterbot also chatbot or bot, a software that is able to engage in dialogue, 
usually by mimicking the mechanics of human conversation, see p. 87. 

chord a combination of usually three or more musical tones sounded simul- 
taneously. 

coarticulation the change in phoneme articulation caused by the effect of 
neighboring sounds. 

cochlea spiral-shaped structure in the inner ear where frequency discrimi- 
nation (“Fourier analysis”) and transduction from sound wave to nerve 
impulses take place. 

cochlear implant microelectrodes, implanted in the cochlea, that deliver 
electrical stimuli to the auditory nerve to alleviate sensorineural deafness. 

cocktail- party effect binaural ability of human listeners to suppress un- 
wanted sounds (such as the speech babble during a noisy cocktail party) 
and concentrate on a single voice. 

code-excited linear prediction (CELP) linear prediction coder va which 
the excitation function for synthesis is derived from a pre-existing code- 
book. 

coding representation of data, usually in digital form, for purposes of data 
compression, encryption etc. 

comb filter electrical or mechanical filter with periodically spaced trans- 
mission peaks. 

combination tone a tone perceived but not physically present in an audi- 
tory stimulus, such as the difference tones /2 — fi and 2/i — /2 resulting 
from nonlinear (quadratic and cubic, respectively) distortion in the mid- 
dle or inner ear. 

compact disc optical recording medium, read out by a laser beam. 

computer simulation mimicking of real-world process (such as flying an 
airplane) on a computer. 

Chomsky hierarchy a ranking of formal languages with respect to their 
generative capacity, see p. 72. 
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chunk parser a robust parsing algorithm that is made up of several (nonre- 
cursive) layers of syntactic analysis which group (i.e. chunk) the observed 
elements to higher-level symbols. In these, only those regions that can be 
recognized are investigated, other elements are simply left unchanged. 
See p. 92. 

compiler a computer program that parses and translates a program written 
in a higher-level programming language into machine language, 
consonance correspondence of sounds; harmony of sounds. In music: a si- 
multaneous combination of tones conventionally accepted as being in a 
state of repose. See also dissonance. 

consonant (of speech) a phoneme produced by diverting (m, n, n^), ob- 
structing (/, V, z etc), or occluding (p, 6, h d-> K 9) fhe flow of air in the 
vocal tract — as opposed to vowel. 

constant- Q of a set of resonances (such as the formants of a speech signal): 
all having the same Q or reciprocal relative bandwidth. 
context-free grammar a formal grammar whose rewrite rules contain only 
one symbol on the left side, i.e., are context-independent. The formalism 
of choice for programming languages. See p. 73. 
context-sensitive grammar as opposed to context-free grammars^ this 
structure allows to express dependencies between the symbols that are 
replaced in a rewrite rule and adjacent elements, see p. 73. 
continuity effect the appearance of continuity of an interrupted visual or 
auditory stimulus. 

convolution integral of a function multiplied by delayed, time-inverted ver- 
sion of another function. 

corpus in natural language processing, a collection of texts that serves to 
investigate the statistics of a language, see p. 78. 
corpus-based processing see statistical processing. 

CPU central processing unit. 

cross-correlation normalized average of signal multiplied by another signal, 
cyberspace environment created by virtual reality. 

D/A digital-to-analog converter. 

daemon an automatic utility program that runs in the background of a 
computer. 

DARPA Defense Advanced Research Project Agency. 

DARPA Communicator a major dialogue system project initiated by 
DARPA, providing a framework for systems working in the travel plan- 
ning domain. 

data-base management software for storing, manipulating and accessing 
large amounts of data. 

data compression coding of data in a more efficient manner. 
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Data Encryption Standard (DES) a popular standard that breaks the 
data into 64-bit blocks and uses a 56-bit key to encrypt messages. Soon 
to be replaced by AES. 
dB decibel 

DCT discrete cosine transform. 

decibel ten times the logarithm to the base 10 of the ratio of two intensities 
or powers. For a ratio equal to 2, the decibel difference is 6 decibels, 
delta modulation coding of a signal by positive and negative pulses repre- 
senting the sign of the difference between a current signal value and its 
expectation. 

demisyllable part of a syllable obtained by cutting it in the middle of its 
steady (vowel) part. 

dependency grammar a grammar that expresses semantic/syntactic rela- 
tions in a sentence as dependencies between words and hence does not 
require an abstract phrase-symbol superstructure. See p. 86. 

DES Data Encryption Standard. 

desktop a display that arranges icons and menus to make the screen look 
like the top of a desk. Popularized by the Apple Macintosh and then by 
Microsoft Windows. 

DFT discrete Eourier transform. 

differential pulse code modulation (DPCM) pulse code modulation 
applied to signal differences. Akin to delta modulation. 
digital having only discrete values (such as the displayed numbers on a cash 
register). 

digital certificate digital encryption method that guarantees the legiti- 
macy of the transmitted information. 

ditgital signatures digital encryption method that guarantees the signa- 
ture under a letter, order, or contract to be authentic, 
digital simulation computer simulation. 
diphones vowel plus postvocalic transition. 

diphthong gliding speech sound, such as ai in my, oi in boy, au in how, ou 
as in low. In English many vowels are diphtongized that are pronounced 
as pure vowels in Italian, Hungarian, German and other languages, 
disambiguation in language processing, the resolution of ambiguities, e.g. 

which meaning of a word is used in a particular sentence, see p. 99. 
discourse knowledge the contextual information of a conversation, deter- 
mines how the interpretation of a sentence depends on previous utter- 
ances. See p. 70. 

discrete cosine transform (DCT) Fourier-like transform based on cosine 
functions. 

discrete Fourier transform Eourier transform for time-discrete (“sam- 
pled”) data. 

Disk Operating System (DOS) venerable computer operating system to 
run programs. 
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dissonance inharmonious or harsh sound. In music: a simultaneous combi- 
nation of tones conventionally accepted as being in a state of unrest and 
needing completion. 

distinctive feature crucial distinguishing mark (voicing, nasality etc.) be- 
tween two phonemes. 

dongle piece of hardware that must be attached to a computer to make 
certain software work. A dongle, also known as hardware key, prevents 
illegal access to software. (The origin of dongle, a neologism, is uncertain.) 

DOS Disk Operating System. 

download to receive a file from another computer via a modem. 

DP CM differential pulse code modulation. 

DRAM Dynamic Random Access Memory. 

driver a piece of software that controls a hardware device. Usually, drivers 
are written by the hardware manufacturer and then integrated into the 
operating system. 

DSP digital signal processing or processor, often realized by an integrated 
circuit. 

DVD Digital Versatile (or Video) Disk, resembles a compact disc. New high- 
density standard for optically recording images, music, and other data on 
a disk. 

dynamic programming an algorithm for finding the “best” path through 
a grid of data. 

Dynamic Random Access Memory (DRAM) a RAM chip that stores 
information in small capacitors. DR AMs have high storage capacity due 
to their simple, small design, but the information must be refreshed pe- 
riodically (approximately every 2 ms) as stored charge tends to leak. 

dynamic time warping see time warping. 

dyslexia difficulty in reading, often caused by brain damage or inherited 
factors. 

e- lancer free (unaffiliated employed) agent who is electronically linked. (A 
rhyming play on free-lancer, originally a medieval mercenary soldier.) 

electronic of or pertaining to processes involving electrons. 

electronic commerce web sites that generate revenue through online sales 
of products or services. 

electronic mail messages sent via the Internet between computers. 

Eliza one of the first and most famous chatterbots. 

ellipsis the omission of a word or phrase necessary for a complete syntactical 
construction but not required for understanding, see p. 95. 

e-mail electronic mail. 

e-mail software software for sending and receiving electronic messages over 
the Internet. 

EM algorithm a powerful optimization method, used e.g. for clustering or 
to find the optimal set of parameter values for a hidden Markov model. 
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The iteration revolves around two steps: Expectation - assuming the 
model (distribution) parameters are known, compute the expectation of 
the data in this model. Maximization - assuming the proper expectation 
values are known, find the maximum- likelihood model parameters, see p. 
82. 

e-money electronic money transmitted by the Internet. Similar formations: 
e-bucks, e-credit, e-commerce, e-tailing (a play on retailing). 

emulate imitate software by another type of software (the “emulator” or 
“emulation program” ) . 

encryption process for making data inaccessible for unauthorized users. 
Modern encryption methods are typically based on number-theoretic al- 
gorithms, such as exponentiation in finite fields. 

entropy measure of randomness and the amount of information given by an 
observable. 

entropy coding source coding based on probability distribution of the 
source symbol. 

envelope curve tangent to each member of a set of curves. Also: curve con- 
necting the peaks of a waveform. 

error signal usually: prediction residual. 

Ethernet software protocol for building networks. 

Exclusive Or (XOR) one or the other but not both. 

expert systems an information processing system that is knowledge based 
and uses programmed computing. 

fast Fourier transform (FFT) algorithm that reduces the computing 
time of an A'-point Fourier transform by a factor up to A/21og2 N. For 
N = 2^^ = 1024 the reduction factor exceeds 50. 

FAQ frequently asked question. 

fax facsimile. To transmit a facsimile (of printed text, photographs, or the 
like) electronically. Originally used by Interpol and other police agen- 
cies for the distribution of photographs (“mug shots”) of criminals. Long 
ignored for nonforensic applications. 

feature structure word properties that indicate the conditions in which a 
word can be used, e.g. number or person of a verb (“does” ^ 3rd person, 
singular). See p. 75. 

fertility model in the statistical approach to automatic translation, a model 
that describes how many words in the target language are created by a 
specific word in the original language. 

FFT fast Fourier transform. 

file transfer protocol (ftp) a type of Internet site for file downloading. 

filter mechanical or electrical device with input and output terminals that 
changes the amplitude spectrum and/or the phase spectrum of a signal 
applied to its input. 
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finite (number) field mathematical structure having a finite number of 
members that permits adding, subtracting, multiplying and dividing. 
Also called Galois field. 

finite state automaton also called finite state machine or transition net- 
work, a simple logical structure consisting of states and arcs that are 
traversed to match a rule, e.g. in a context-free grammar^ see p. 73. 

finite state transducer an extension of the finite state automaton struc- 
ture that achieves generative capacity by mapping from one set of 
states/symbols to another one. See p. 74. 

FIR finite impulse response (of a filter). FIRs have zeroes in their transfer 
functions. 

firewall defensive software that protects a computer system from unautho- 
rized intruders. 

FIR neural network an implementation of a time- delay neural network 
where each synapse (link between neurons) is represented as a linear, 
time-invariant (Iti) filter. Thus, the input received from each synapse 
can be described as the convolution sum of a finite impulse (the delayed 
inputs) and the impulse response of the Iti filter. 

floppy floppy disk. 

floppy disk a flexible magnetic disk for storing digital data. 

FM synthesis a simple way to generate sounds by frequency modulation of 
a sine wave. 

formal language as opposed to natural language, a formal language can be 
entirely generated and analyzed by a formal grammar, that is, a grammar 
consisting of a set of word categories, rules to combine these to higher- 
level symbols, and the set of these symbols. See p. 72. 

formant resonance of the vocal tract. From musicology where a formant is 
one of the resonances of a musical instrument. Different formant frequen- 
cies distinguish different vowel sounds of human speech. 

Fourier transform mathematical transformation (of a time signal such as 
speech) that picks out the individual frequency components (“harmon- 
ics,” “overtones”). 

frame in natural language processing, a structure that contains semantic 
information of a dialogue or of an utterance in one or more slots. See p. 
97. 

freeware free software. 

frequency channel a slice of a speech spectrum with a bandwidth between 
100 and 300 Hz. 

frequency diversity method of communication in which the signal is trans- 
mitted over several frequency channels to combat interference from other 
sources. Similar formations: time diversity and space diversity (the use 
of several transmitting or receiving antennas). See also spread spectrum. 

frequency hopping a combination of frequency diversity and time diversity 
to reduce interference from other sources occupying the same frequency 
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band. Frequency hopping is also used to reduce range and velocity am- 
biguity in radar. Some optimal hopping schemes are based on number 
theory. 

frequency warping transformation of the frequency scale to the Bark scale 
to conform to the frequency analysis in the inner ear. 

fricative speech sound with audible friction produced by forcing air through 
a constriction in the vocal tract (/, v] 5, z; 5 / 1 , zh; th as in thin, th as in 
they). 

FSA finite state automaton. 

FST finite state transducer. 

FTP file transfer protocol (ftp) 

function word preposition, article, auxiliary or pronoun such as an, the 
and, in, etc. — as opposed to context words. 

fundamental frequency (fo) for a periodic signal the fundamental fre- 
quency is the greatest common divisor of its harmonic frequencies. The 
fundamental frequency of a sound signal (if it exceeds 20 Hz) determines 
the perceived pitchy even if it is physically absent. The pitch percept in 
the case of the “missing fundamental” is called residue pitch. 

Furby a furry toy, stuffed with electronics. A purported threat to the Na- 
tional Security Agency (see p.45). 

fuzzy logic logic based on fuzzy sets. 

fuzzy sets a generalization of a classical set with the property that each 
member of a population of objects has associated with it a number, usu- 
ally from 0 to 1, that indicates the degree to which the object belongs to 
the set. 

fo fundamental frequency. 

Galois field finite field of numbers based on the power of a prime number. 

Gaussian variable random variable with Gaussian ( “normal” ) probability 
distribution such as many hiss-like noises. 

GB gigabyte. 

generative grammar a set of rules that determines the form and meaning 
of words and sentences in a given language. 

GHz one billion Hertz. 

GIF graphic image format, based on lossless entropy coding. 

Gigabyte one billion (10^) bytes. Modern computers have hard discs with 
typically several gigabytes of memory. 

glottal wave air flow emanating from the vocal cords. For voiced sounds 
the glottal wave consists of quasiperiodic puffs of air. 

grammar a language’s morphosyntax, i.e. the rules of morphology and syn- 
tax^ see p. 71. See also formal language and natural language. 

Graphical User Interface (GUI) computer operating system that uses 
icons and symbols to launch and operate programs. 
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Grice’s maxims the rules of a cooperative dialogue as summarized by 
Grice: Information in a conversation should be given in the appropri- 
ate quantity, quality, be related to the context, and given in the adequate 
manner. See p. 88. 

group delay delay of the envelope of a group of frequencies. 

GUI Graphical User Interface. 

hacker person who intrudes a computer system; also someone who writes 
programs of a somewhat routine nature. 

Hadamard transform binary transform based on orthonormal Hadamard 
matrices. Hadamard matrices of order 2^, like the fast Fourier transform, 
permit a fast algorithm. 

Hamming distance number of bits that are different in two binary code 
words. 

Hamming window function of time (or frequency or space) which, used as 
a window function, causes a low amount of spectral splatter. 
hands-free telephone speakerphone. 

handshake initial exchange of information between two modems to establish 
an electronic link on the Internet or between two fax machines, 
hard disk a digital mass storage device consisting of one or more rigid mag- 
netic disks rotating at high speed. 

hard limiter nonlinear electronic device that reduces all input values to one 
of two fixed output levels. 

hardware physical (computer) devices (as opposed to software). 
harmonic pertaining to, or denoting a series of oscillations in which each 
oscillation has a frequency that is an integral multiple of the same basic 
or fundamental frequency. 

hash algorithm in encryption: Method that arranges fixed-length pieces of 
a message into blocks before encryption and yields a distinct output (the 
digest, or “hash”). Used as a digital fingerprint to detect forgeries, 
head in natural language processing, the word or subphrase that determines 
the syntactic function of a phrase, see p. 75. 
head-lexicalized grammars a grammar that expresses syntactical rela- 
tions between phrases as a function of the phrases’ heads, see p. 75. 
Heisenberg uncertainty product of standard deviations of energy distri- 
butions in time and frequency (or any other Fourier conjugate variables). 
In quantum mechanics conjugate variables are position and momentum, 
angle and angular momentum, energy and time, etc. 
hidden Markov model (HMM) statistical model that describes input- 
output relations of sequentially occurring signals (such as phonemes in a 
speech sample) using internal, “hidden” states and transition probabili- 
ties between them. One of the most powerful tools in speech recognition, 
highpass filter a mechanical or electronic device that lets only high fre- 
quencies pass through and blocks out low frequencies. 
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Hilbert envelope envelope obtained with the help of the Hilbert transform. 

Hilbert transform integral transform corresponding to a 90°-phase shift 
in the Fourier transform (“frequency domain”). 

HMM hidden Markov model (for automatic speech recognition). 

homepage the web site of a person, institution, company or other entity 
rather than a site dedicated to an abstract topic. 

homomorphic filtering nonlinear filtering of signals utilizing the complex 
cepstrum. 

HTML hypertext markup language. 

HTTP hypertext transfer protocol. 

hyperlink a technology that lets users jump from one item to another by 
clicking with a mouse on a word or icon that points to some other part 
of the network. 

hypertext a computer text document that is connected to others through 
hyperlinks. 

Hz Hertz, formerly “cycles per second,” measure of frequency or bandwidth. 

IBM International Business Machines Corporation introduced first personal 
computer (PC) in 1981. 

IC integrated circuit. 

icon a small image that represents a file, program or location on the Internet. 

HR infinite impulse response (of a filter). Minimum-phase or allpole filters 
have IIRs and stable inverses. 

illocutionary act see speech act. 

infinite clipping setting all positive values of a signal equal to +1 and 
all negative values equal to —1. Center clipping leaves a speech signal 
moderately intelligible. See also hard limiter. 

initiative in dialogue systems, a dialogue turn can be initiated by the 
speaker (user initiative), the machine (system initiative), or both (mixed 
initiative) . 

integrated circuit (IC) solid-state circuit consisting of interconnected 
semiconductor devices like transistors, capacitors and resistors that form 
a logical unit on a small chip. 

Intel manufacturer of microchips. 

interactive (with regard to computers) interacting with a human user to 
obtain data or commands and to give immediate results. 

Internet a decentralized collection of networks that connects dissimilar com- 
puters around the world and allows them to send and receive data by 
following a set of global communications rules. The Internet is the plat- 
form for e-mail^ the World Wide Web^ file transfers and chat programs, 
among other technologies. 

Internet Explorer web browser by Microsoft. 

Internet protocol (IP) a low-level convention that allows computers to 
move packets of data across the Internet. 
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Internet service provider (ISP) organization as America Online (AOL) 
that allows you to connect your computer to the Internet. 

intonation the melody or pitch contour of speech. 

inverse filter filter with a transfer function that is the reciprocal of that of 
a given filter. Inverse filtering of a speech signal with the inverse of the 
vocal tract transfer function produces the glottal wave. 

I/O input /output. 

IP Internet protocol. 

IPA the international phonetic alphabet, see phonetic alphabet. 

IRCAM Institut de Recherche et de Coordination Acoustique Musique, a 
department of the Centre Pompidou, Paris, established by Pierre Boulez 
to foster modern music research. Originally conceived as a Max Planck 
Institute, it was rejected - in no uncertain terms - by Werner Heisenberg 
who could not see modern music as a proper concern of the august Ger- 
man body. (IRCAM was briefly called IRAM before the C was inserted to 
avoid - given the singularities of French pronunciation - confusion with 
Iran.) 

is third person singular of the present tense indicative of the verb “to be.” 

island parser a robust parser that only processes words and phrases in 
regions of confidence and ignores other elements. 

ISP Internet service provider. 

JAVA a computer language developed by Sun Microsystems that produces 
programs that run on almost any computer or operating system. Its com- 
patibility and ease of use make it an increasingly popular language for 
developing applets, tiny applications that can be sent quickly over the 
World Wide Web. 

JPEG Joint Photographic Expert Group which sets the standards for image 
coding and transmission over the Internet. 

kHz one thousand Hertz. 

k-xneans a popular clustering algorithm, see p. 83. 

LAN local area network. 

language model a model that can analyze and/or generate the word strings 
of a language. An important use is to assess whether a string is a proper 
sentence in a specific language and how likely it is. 

larynx the valve at the top of the windpipe. 

a set of high level commands that allows one to take advantage of 
TpjX^s text formatting capabilities in a more comfortable way. 

lemmatization a process that maps a word to its root word form, or rather, 
to its lemma, an abstract class of all different forms a word can take, e.g. 
does, did, doing ^ <do lemma>. 
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lexicon the words of a language and their respective parts of speech; also 
the inventory of a language’s phrase types and morphemes, 
limiter see hard limiter. 

linear predictive coding (LPC) predicting a present value of a signal (a 
speech signal, for example) by linearly combining past values, 
linguistics the study of language, including phonetics, phonology, morphol- 
ogy, syntax, and semantics. See Sect. 4.1. 
links (not the German left.) Connections between hypertext documents, 
soundfiles, software etc. that can be activated by clicking on a symbol 
like an icon or highlighted text. These links allow one to connect hyper- 
texts with related topics by a mouse click, even when the systems where 
the documents are stored are thousands of miles apart. 

Linux simplified clone of Unix. 

liquid frictionless speech sound, with partly obstructed air flow from the 
lungs, that can be held steady like a vowel (especially I and r). 
local area network (LAN) a small computer network, e.g. in an office 
building. 

logical form a scheme for the representation of semantics, see p. 76. 
Lombard effect the change of vocal characteristics employed by a speaker 
to cope with noisy environments. 

lossless coding method of coding that permits the complete reconstruction 
of the original data. (Lossless coding of English text can save about half 
the necessary hits.) 

lowpass filter a mechanical or electronic device that lets only low frequen- 
cies pass through and blocks high frequencies. 

LPC linear predictive coding. 

Lucent Technologies research and manufacturing company, including Bell 
Laboratories, split off from AT&T in 1994. 

MA moving average: process for smoothing data. 

Ma Bell Ma as in Mama: the Bell Telephone System consisting (before the 
break up of AT&T in 1984) o^ AT&T, Western Electric, Bell (Telephone) 
Laboratories and 23 “operating companies.” 

Mac OS the operating system that runs the Apple Macintosh computer, 
mainframe computer large computer, often the hub of a system serving 
many users. 

masked threshold threshold of hearing in the presence of a masking noise, 
masking in hearing: a strong sound (the “masker”) making a “weaker” 
sound inaudible. 

matched filter mechanical or electrical filter with transfer function that 
is the complex conjugate of the Fourier transform of the signal to be 
detected. 

MB megabyte. 
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megabyte one million bytes. Top laptops have typically 256 to 1024 Mega- 
bytes of random access memory. 

mel subjective scale characterizing the “tone height” or pitch of a sound, 
similar to the Bark scale, 
memory organ to forget with. 

MHz one million Hertz. 

microchip a semiconductor device that serves as an integrated circuit. 
microcomputer compact computer with lower capabilities than a minicom- 
puter. 

microprocessor integrated circuit (e.g. in a computer or appliance) etched 
on layers of silicon that organizes the central electronics of a computer 
on a chip. 

Microsoft software developer, best known for its Windows operating sys- 
tems. 

MIDI Musical Instruments Digital Interface. 
millennium bug year 2000 problem or year 2000 bug. 

minicomputer computer with processing capabilities smaller than those of 
a mainframe computer. 

minimum description length method a clustering algorithm that parti- 
tions the data so that the code length required to describe both data and 
clusters is minimal, see p. 84. 

MIT Massachusetts Institute of Technology, 
mixed initiative see initiative. 

modem modulator-demodulator: device for converting digital data to analog 
data and vice versa. 

modulation transfer function factor with which different modulation fre- 
quencies of a signal are multiplied. 

module part of a computer program that performs a distinct function 
or 

an interchangeable, plug-in hardware unit, 
monitor device with a screen for viewing data at a computer terminal, 
morphemes the smallest meaningful pieces into which words can be cut. 
morphology the study of word formation in a language, 
morphotactics the laws of morpheme combination in a language. E.g. rules 
of the kind “To build the present tense, third person form of a verb, attach 
an ‘s’ behind the stem.” 

mouse a palm-size computer input device that is moved on a flat surface 
to change the position of the cursor on the screen, to open menus, enter 
data etc. 

MPEG Motion Picture Expert Group, which sets standards for picture cod- 
ing for the Internet. The latest standard, MPEG 4, includes motion com- 
pensation and object recognition. 

MS-DOS Microsoft disk operating system. 

multimedia integration of still images, videos and sound (music and speech). 
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multimedia software software that enables audio and video content on 
computers. 

multimedia publishing computer software that combines text with video, 
animated graphics and sound. 

multimodal using multiple modalities, for instance a dialogue system that 
communicates visually and with speech output. 

multi-tasking a system that concurrently carries out multiple tasks, e.g. a 
car dialogue system that handles a navigation request, and, while doing 
so, gives information about the engine state. 

multi-threading in speech dialogue system terminology, an application 
that can handle more than one line of communication. 

Musical Instruments Digital Interface (MIDI) serial bus to connect 
electronic musical instruments; file standard which stores musical infor- 
mation efficiently as codebook entries. 

mutual information describes how much more information is contained in 
the sum of separate variables than in the combined distribution of all 
variables, see p. 83. 

nasal speech sound emanating partly (as in French nasal vowels) or entirely 
through the nose (m, n, and ng as in sing). 

National Security Agency (NS A) government agency, headquartered in 
Ft Meade, Maryland, charged with the design and breaking of secrecy 
codes. 

natural language a language spoken by humans, as opposed to a formal 
language. 

net the Internet. 

netiquette rules for good behavior on the Internet. 

Netscape Navigator web browser by Netscape. 

neural net (work) mathematical model simulating the behavior of biolog- 
ical neural networks for pattern recognition, speech processing and self- 
directed problem solving. 

neurocomputing an alternative to programmed computing. An approach 
to develop information processing capabilities for tasks where the algo- 
rithms or rules are not known or cannot be implemented. This is achieved 
with parallel, distributed, adaptive information processing systems such 
as neural networks, genetic or fuzzy learning systems, and learning au- 
tomata. 

n-gram a structure and method to obtain language statistics by moving a 
window of n words over a text corpus and recording how often a partic- 
ular tuple of n words was encountered, n-grams, especially trigrams and 
bigrams, commonly serve as language model. See p. 80. 

NLP Natural Language Processing. 

NLU Natural Language Understanding. 
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nonstandard computing computing, usually in a highly parallel mode, 
making use of molecules (e.g. DNA and RNA), cellular automata, or 
quantum mechanical states. 

NS A National Security Agency. 

NSFnet high-speed backbone network established by the National Science 
Foundation (1987-1995). 

Nyquist rate The smallest possible sampling rate that avoids aliasing. 
Nyquist theorem see sampling theorem. 

Office Suite a collection of software applications that can share data, 
offline operating independently of an associated computer (as opposed to 
online). 

one-time pad encryption method that uses a key only once and then dis- 
cards it for better protection against decryption, 
online operating under direct control of a main computer (as opposed to 
offline). 

online publishing publishing on the Internet. 

online service a business that provides dial-up access to information, en- 
tertainment, e-mail and chat groups, among other features, 
ontology in natural language processing, the set of different classes of (se- 
mantic) objects, the representation of a domain. See p. 96. 
operating system the software that allows users and application programs 
to interact with and control a computer or microprocessor and its pe- 
ripheral devices. Examples include Mac OS, Windows, and Unix. 
orthonormal system of functions or sequences that are orthogonal (“lin- 
early independent” ) to each other and normalized to have unit energy. 
OS computer operating system. 

oversampling sampling at a rate above the Nyquist rate. 
overtone an acoustical frequency that is higher in frequency than the fun- 
damental. 

packet a package of data that travels together on the Internet. 
parallel bus computer hus that transmits several bits simultaneously (in 
parallel) as opposed to a serial bus. 

parsing in natural language processing, the analysis of sentence structure, 
that is, parts of speech, phrases, and their syntactic relations. See p. 92. 
partial masking in hearing: a strong sound reducing the loudness of an- 
other, usually weaker, sound. 

partial tone one of the pure tones forming a part of a complex tone. Also 
called partial. 

part of speech a word category such as verb, noun, pronoun, article, ad- 
jective, adverb, etc. 

PC personal computer (PC). 

PCFG probabilistic context-free grammar. 
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PCM pulse code modulation. 

peak clipping limiting the amplitude range of a signal. See also infinite 
clipping. 

peak factor ratio of the amplitude range of a signal to its root-mean- square 
value. For a sine wave the peak factor equals 2\/2 2.8 . 

peak value highest value of a signal, 
pel picture element (before 1970), now called pixel. 

perceptron historically, a single-layer neural network. The perceptron is 
incapable of executing the Exclusive Or or XOR function, 
personal computer desktop or laptop computer compatible with IBM 
computers (as opposed to ylpp/e/ Macintosh). 

PGP Pretty Good Privacy. 

phase delay phase shift, expressed as a time. 

phase spectrum phase angles as a function of frequency (as in the Fourier 
transform of a signal). 

phon unit of loudness level of a sound, obtained by comparison with a 1-kHz 
tone. 

phone speech sound. 

phoneme any of the 15 to 70 distinctive speech sounds of a language, 
phonetic pertaining to the production and transcription of speech sounds, 
phonetic alphabet a set of symbols used for phonetic transcription such 
as the international phonetic alphabet, see p. 68. 
phonetics the study of speech sounds, understood as a signal, see p. 68. See 
also phonology. 

phonetic spelling Dutch spelling is largely phonetic. English is decidedly 
not. (Think of the “spelling” of fish as ghoti: gh as in enough, o as in 
women, ti as in nation.) 

phonology the study of speech sounds, investigated as a language phe- 
nomenon, see p. 68. See also phonetics. 
phonotopic map mapping from a speech signal or its spectrum to a two- 
dimensional space in which adjacent formant frequencies are adjacent, 
photonic of or pertaining to processes involving photons, 
phrase a group of words that behaves as a unit in a sentence and has some 
coherent meaning. 

pitch subjective frequency of a tone. 

pixel smallest element of an image that can be individually processed and 
displayed. 

place of articulation location in the vocal tract at which two speech organs 
(such as tongue tip and teeth or tongue body and palate) approach each 
other or come together. 

plain old telephone service (POTS) service of the kind that old Ma Bell 
provided. 

platform a fundamental layer of software required to make other programs 
run. The word is used interchangeably with operating system, which is 




328 Glossary 



the most common type of platform. The Internet is another, and local 
networks, web browsers and Java are all frequently viewed as platforms, 
plosive stop consonant characterized by sudden air pressure release (p, t, 
d\ k, g). 

point of articulation place of articulation. 
pole a resonance in a signal or transfer function. 

PONS^^ Prometheus OrthoNormal System, a binary coding scheme that 
minimizes Heisenberg uncertainty. 
port a connection, or channel, into a computer. 

PostScript a popular, flexible printing and plotting language for ready-to- 
print files that allows electronic file transfer to other institutions. 

POTS plain old telephone service. 

power spectrum squared magnitude of the Fourier transform. 
pragmatics the study of those principles that form our understanding why 
certain sentences are anomalous, or not possible utterances in a given 
context. Also, more generally, pragmatics is often understood as meaning 
interpretation using semantics, context, and world knowledge, see p. 70. 
prediction error prediction residual. 

prediction residual remaining error in a predictive analysis system such 
as linear predictive coding. 

predictive coding predicting a present value of a signal (a speech signal, 
for example) from its past values, 
predictor coefficient coefficient in a predictor polynomial. 
predictor polynomial polynomial that predicts a present signal value from 
its past values. 

presentation software software for creating business presentations on com- 
puter screens. 

pretty good privacy a simplified version of a fully secure encryption sys- 
tem. 

probabilistic context-free grammar a stochastic version of the context- 
free grammar scheme that assigns probabilities to its rewrite rules. See 
p. 84. 

programmed computing problem-solving by devising an algorithm and/ 
or a set of rules and then coding these in software. So far the most 
common software design approach. Less flexible than neurocomputing. 
prosody the stress and intonation patterns of an utterance, 
protocol rules and standards for information transfer between computers, 
protolanguage an ancestor of today’s natural languages in which strings 
of semantic chunks were communicated without syntax. More general, 
protolanguage is often understood as a language that is the observed or 
hypothetical progenitor of another language or group of languages, also 
called Ursprache. 

PSOLA a concatenative speech synthesis method. Segments of recorded 
speech are adjoined through Pitch- Synchronous OverLap-Add^ see p. 105. 




Glossary 329 



psychoacoustics the study of sound perception, a subfield of psychophysics. 

psychophysics the branch of psychology that describes the relation between 
physical stimuli and the resulting sensations. 

public key cryptosystems system for encrypting data, using generally ac- 
cessible (“public”) keys and mathematical functions that are easy to 
execute in one direction (such as multiplying) but very difficult in the 
opposite direction (factoring). 

pulsation threshold the level at which an interrupted stimulus (a tone or 
speech) in the presence of an alternating noise sounds continuous as a 
result of the auditory continuity effect. 

pulse code modulation (PCM) replacing an analog signal by a sequence 
of discrete or digital values. 

Q also Q-factor: the resonance frequency of a resonance (such as a formant 
of speech) divided by its bandwidth. 

quantizing converting an analog value into a discrete or digital one. 

quantizing noise signal residues remaining as imprecisions when convert- 
ing an analog signal to discrete (digital) values. The higher the time res- 
olution {sampling rate) and amplitude resolution {wordlength) the better 
the signal-to-noise ratio. 

quantum computer (so far nonexistent) computer exploiting the very high 
degree of parallelism implicit in quantum mechanical systems and there- 
fore promising extremely high computing speeds (e.g. for factoring large 
composite numbers in cryptography). 

quefrency independent variable of the cepstrum. If the signal is a function 
of time (such as speech), then quefrency also has the dimension of time. 

radian frequency angular frequency. 

random access memory (RAM) fast storage device used by computers 
during calculation. A top laptop typically has 512 Megabytes of memory 
at the time of writing (2004). 

read-only memory a random access memory whose content is fixed during 
manufacture and cannot be changed subsequently. 

RealAudio radio programs distributed over the Internet. Made possible by 
speech and music compression. 

real-time a computer processing mode in which incoming data is processed 
instantaneously, without interrupting the data stream. (In television real- 
time is referred to as “live.” ) 

recurrent backpropagation neural network a recurrent spatiotemporal 
neural network that feeds back the delayed outputs of all the top-level’s 
neurons to all the neurons in the lowest level. At each time step, an 
external input vector is supplied to some of the lowest-level neurons and 
an output vector is received from some of the highest-level neurons. 
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recursive transition network a finite state automaton that can refer to 
other finite state automata and form larger structures in this manner, see 
p. 73. 

reference resolution the interpretation of anaphoric relations. For exam- 
ple, in the anaphora “I want to go to the beach. It’s always nice there.”, 
linking “there” to “the beach.” 

regular language a formal language that can be described by a finite state 
automaton^ see p. 72. 

relative bandwidth bandwidth divided by resonance frequency. The rela- 
tive bandwidth equals the reciprocal of Q, 
residual prediction residual. 

residue pitch pitch percept engendered by the higher harmonics of a peri- 
odic or nearly periodic sound. 

resonance concentration of energy in the spectrum (of a sound), 
rewrite rules syntax/ grammar rules that describe which phrase symbols 
can be derived from which higher- level symbol. For example, the rule S 
^ NP VP describes that a sentence can consist of a noun phrase and a 
verb phrase, NP ^ ART ADJ N states that a noun phrase can be made 
up of an article followed by an adjective and a noun. See p. 71. 

ROM read-only memory. 

root-mean-square square root of the average of the squared signal ampli- 
tude. 

rough in phonetics: uttered with aspiration, aspirated. In music: a dissonant 
sound. 

RSA R.L. Rivest, A. Shamir, and L.A. Adleman, inventors of public key 
cryptosystems. 

sample instantaneous value of a signal, 
sampling noise see quantizing noise. 

sampling rate rate of data samples. For 4 kHz-bandwidth speech, for ex- 
ample, the sampling rate is typically somewhat above 8 kHz. See also 
Nyquist rate. 

sampling theorem also called Nyquist theorem, states the fact that the 
sampling rate {Nyquist rate) needs to be higher than twice the highest 
frequency component of the sound to be discretized in order to prevent 
aliasing. 

scanner photoelectric device for scanning and digitizing a picture or text 
(e.g. for further processing by a computer). 
or 

a program that attempts to learn about the weaknesses of a victim com- 
puter by repeatedly probing it with requests for information, 
segmentation cutting up of words into syllables and phonemes. 
self-steering array array of microphones (hydrophones or loudspeakers) 
that homes in on a target automatically. 
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semantic hierarchy in natural language processing, a classification of se- 
mantic items into a tree structure. Child entries are specialized members 
of the parent class, inherit its properties and add new ones of their own, 
see p. 77. See also semantic network. 

semantic network a classification of semantic items. As opposed to a se- 
mantic hierarchy which only allows an inheritance-relation tree structure, 
all classes can be connected, the dependencies form a graph. See p. 77. 
semantics the rules that specify the meaning of words and sentences, 
semi-vocoder vocoder in which only part of the spectrum is coded. The 
voice- excited vocoder is a semi- vocoder, 
sensor microphone, hydrophone or other device sensing physical data, 
serial bus computer bus that transmits data bits one after another (serial) 
as opposed to a parallel bus. 

server a data processing unit linked to a large computer. Serves as a large 
data buffer and distributor. 

shareware software that can be downloaded from the Internet. 
shell a software layer that provides the interface between a user and the 
operating system of a computer. 

short-time spectrum Fourier transform based on a short time segment of 
a signal (such as speech), 
sibilant fricative. 

signal- to- noise ratio (SNR) ratio of signal power to noise power on a 
communication channel. Together with the bandwidth of the channel, the 
signal-to-noise ratio determines its information carrying capacity, 
simulation computer simulation. 

sine wave a signal having a sinusoidal dependence of its amplitude as a 
function of time, such as a pure tone. 

sleep mode permits the reduction of the amount of power consumed by a 
computer while it is not in use. 

slot in natural language processing: part of a /mme, holds one semantic item, 
i.e. an attribute value. See p. 96. 
sniffer a program that records computer and network activity. 

SNR signal-to-noise ratio. 

soft palate the posterior soft portion of the palate that separates the oral 
cavity from the nasal cavity, 
software computer program. 

sone unit of subjective loudness. By definition, 1 sone is the loudness of a 
binaural 1-kHz tone at a sound pressure of 40 dB above the threshold of 
hearing. A sound that is perceived as twice as loud has a loudness of 2 
sones. 

sonogram spectrogram of sound signal. 

soundcard computer component generating and processing sounds. Sound- 
cards contain A/D and D/A converters and use FM or wavetable syn- 
thesis. 
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spam (e-mail) junk e-mail. 

spatiotemporal neural network a neural network that can deal with in- 
puts and outputs that are explicit functions of time, such as in real-time 
speech processing. To obtain these dynamic properties, the network must 
be given memory, either as time delay or feedback (recurrent network), 
speakerphone telephone with loudspeaker. 

spectrogram two-dimensional graphic representation of spectral energy dis- 
tribution over time and frequency. 

spectrum magnitude of Fourier transform^ also amplitude spectrum. 
spectrum (or spectral) flattener device that flattens the spectrum of a 
signal, thereby suppressing any resonance (formant) structure, 
speech act the act performed when making an utterance, understood to 
comprise a perlocutionary act responsible for the production of an effect 
in the receiver of an utterance, and an illocutionary act that describes 
whether the utterance is an assertion, a yes/no question, a suggestion, a 
command, etc. See p. 94. 

spreadsheet software for analyzing and modeling financial and other nu- 
merical data. 

spread spectrum method of communication in which the signal is spread 
over a wide spectrum to combat multipath and other interference, 
square wave a signal having only two distinct amplitude values, 
statistical language processing as opposed to symbolic language process- 
ing^ the statistical approach aims to learn the ability to analyze and to 
generate language by obtaining probabilities from large text corpora. See 
p. 77. 

stemmer in natural language processing, an algorithm that reduces an ob- 
served word form to its stem, e.g. does — > do, mice ^ mouse. See p. 
74. 

stop consonant a consonant in which the air flow is completely blocked (p, 
t, k; b, d, g). 

syllable uninterrupted segment of speech comprising a “center” of relatively 
great sonority. Examples of one-syllable English words are man, wolf, 
sheep etc. Human, mankind, kindness etc. have two syllables, 
symbolic language processing as opposed to statistical language process- 
ing^ the symbolic approach aims to understand language through gram- 
matical rules rather than probabilities. See p. 73. 
syntax rules for the formation of grammatical sentences in a language. (Not 
a misspelled tax.) 

TAG see tree- adjoining grammar. 

TCP transmission control protocol, the set of communications conventions 
that enable the sending and receiving of data over the Internet. 

TfcX read: [tek] or [tey] (x as in Scottish loch). A powerful programming 
language for text formatting; popular because of its ability to produce 
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book quality text, especially for scientific and technical works. As opposed 
to a word processor with which text can be entered, formatted, displayed 
and printed, only assumes the role of a formatter /typesetter. 

text-to-speech the process or application that turns the text representation 
of an utterance into a synthesized speech signal, see p. 103. 

telemedicine delivery of healthcare, especially medical diagnosis, over the 
Internet. 

Telnet network for exchanging data (excluding graphics) between comput- 
ers. Introduced in the 1970s, it is still popular because of its speed. 

tense relative time of occurrence of the event described by the sentence, the 
moment at which the speaker utters the sentence, and, often, some third 
reference point. 

timbre sound quality (as opposed to piteh) such as the different sound qual- 
ities of different musical instruments or human vowel sounds. 

time-delay neural netw'ork (TDNN) a spatiotemporal neural network 
whose hidden and output units receive not only the present input value 
but also one or more of the previous ones. Originally devised to capture 
the concept of time symmetry as encountered in phoneme recognition 
from a spectrogram. 

time warping a popular algorithm for small speech recognition applica- 
tions, see p. 90. In this pattern recognition method, a signal s is matched 
against all templates ti, . . . , tn'm a system’s repository, s is stretched and 
compressed along the time axis to account for possible vowel duration 
variations. 

token in natural language processing, a group of letters that qualifies as a 
word, see p. 78. See also word type. 

tonotopic mapping adjacent frequencies and modulation frequencies are 
represented in the cortex (“brain”) by adjacent areas. 

transfer function the (complex) ratio of output voltage or pressure of a 
linear system (such as an electrical filter or the voval tract) to the input 
quantity. 

transition network see finite state automaton. 

tree-adjoining grammar a grammar that constructs a sentence’s syntax 
tree by adding branches from a list of known subtrees to a small initial 
structure. See p. 92. 

treebank in natural language processing, a database of syntax trees, used 
to train stochastic grammars such as probabilistic context-free grammars 
or parsers. 

trigram see n-gram. 

triphone a window of three phones. Important to capture co-articulation 
between these phones for speech recognition and synthesis. 

TTS see text-to-speech. 

tuning curve the firing rate of an acoustic neuron as a function of frequency. 
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Taring machine a simple computer consisting of an infinite strip of paper, 
and a processor that moves along the paper and prints or erases sym- 
bols on it in a sequence that depends on which symbol the processor is 
currently reading and which of several states it is in. 

Turing test a criterion for machine intelligence; Turing: If a machine’s re- 
sponses in the ‘imitation game’ setup (see p. 87) were indistinguishable 
from those of a human, it is safe to deem the machine thinking. 

uncertainty Heisenberg uncertainty. 

undersampling taking samples at a sampling rate that is too small to cover 
a given bandwidth, causes aliasing and/or sampling noise. 

unification in natural language processing, a syntactic/semantic analysis 
that groups (i.e. unifies) words and phrases that have matching feature 
structures. 

universal resource locator Internet address, usually starting with 

Unix a powerful operating system especially suited for servers^ the large 
computers that power networks or data bases. Invented at Bell Labora- 
tories in 1969, Unix is now a splintered family of operating systems that 
includes IBM’s AIX, Sun Microsystems’ Solaris, and the public domain 
platform Linux. 

unvoiced voiceless speech sound. 

upload to place a file on another computer system via a modem. 

upward spread of masking in hearing: the fact that the masking or par- 
tial masking^ by a given masking sound, of a higher- frequency sound is 
more pronounced than that of a lower- frequency sound. This frequency- 
asymmetry of masking stems from the fact that low frequency waves 
pass the region of high-frequency detection on the basilar membrane in 
the inner ear, whereas high-frequency waves hardly reach the regions of 
low-frequency detection. 

URL universal resource locator. 

USB universal serial bus. 

utilities software that performs maintenance, diagnostics or repairs on com- 
puter hardware or software. 

vaporware computer jargon: a product, especially software, that is pro- 
moted or marketed while it is still in development and that may never 
be produced. 

vector quantizing simultaneous quantizing of several signal samples such 
as successive speech samples. 

velum the soft palate. 

Verbmobil a major automatic translation project supported by the German 
Federal Ministry of Education and Research (1992-2000). 

VEV voice- excited vocoder. 
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virtual memory way of extending the main memory by allowing the pro- 
grammer to access slower backing storage (normally the hard disk) in the 
same way as immediate access store {RAM chips). 

virtual reality realistic simulation of an environment, including three-dim- 
ensional graphics and sound. 

virus a set of software instructions that damage or erase information, work 
files, or programs on a computer. 

visible speech spectrogram of speech signal. 

vocal cords the elastic bands near the “Adam’s apple” of a human that 
vibrate during voiced (not whispered) speech. 

vocal tract the “cavity” in the human head between the vocal cords and 
the lips. Its resonances determine the acoustic quality and phonetic value 
of a speech sound. 

vocoder (also channel vocoder) from voice coder. Electronic device that an- 
alyzes a speech signal in terms of its amplitude spectrum, separating the 
spectral envelope from its spectral fine structure (pitch) and synthesizing 
an artificial speech signal from the pitch information and the spectral 
envelope. The latter information is carried by typically 6 to 16 frequency 
channel signals. The total bandwidth for transmitting this spectral infor- 
mation is roughly 1/10 of the bandwidth of the speech signal itself. 

voiced speech sound produced with vibrations of the vocal cords — as 
opposed to voiceless speech sound. 

voice-excited vocoder (VEV) vocoder in which the excitation signal is 
obtained from the low-frequency components (the baseband) of a speech 
signal. 

voiceless speech sound produced by air friction without vibration of the 
vocal cords — as opposed to voiced speech sound. 

voiceprint graphic representation of a person’s voice showing energy as a 
function of time and frequency. 

volume focussing matched filters applied to an array of sensors (hy- 
drophones in the ocean, for example). In multipath transmission media, 
this results in an array focussed on a limited volume in three-dimensional 
space rather than just a directed beam. 

Voronoi cell region in multi-dimensional (signal) space where each point 
inside a given cell is closer to its quantized value than to any other quan- 
tized value. 

vowel speech sound, such as a/i, e/i, ee, oft, oo, produced without obstructing 
or diverting the flow of air from the lungs — as opposed to consonant. 

WAN wide area network. 

war dialer a program that will automatically dial a range of telephone num- 
bers. 

waveform the shape of a signal. A speech waveform (in air) is the sound 
pressure as the function of time. 




336 Glossary 



wavelet literally: little wave, a waveform used in signal analysis and synthe- 
sis. Compactly supported wavelets have limited extent. Scaling wavelets 
are derived by scaling and shifting the independent variable of a “mother 
wavelet.” 

wavetable synthesis a process where sound samples (often of real instru- 
ments) are digitally stored and then manipulated for playback. 

.wav file waveform file, a computer file for storing sound. 

web World Wide Web. 

web browser a program that enables users to navigate the World Wide 
Web^ interact with other programs and users on the Internet^ and call up 
and display multimedia files. 

web page a quantity of information on the web that has one URL and can 
be watched in one frame of a web browser. 

web site a source of information on the web consisting of one or an ensemble 
of web pages usually dealing with one specific topic. 

Western Electric former manufacturing arm of AT&T. 

wide area network an extensive computer network connecting machines 
over a longer distance than in a LAN. 

window function of time through which only a portion of a running signal 
is seen. 

Windows an integrated family of operating systems developed by Microsoft 
to bring a common look and feel to computers spanning a wide range of 
capabilities. They include Windows 95 and Windows 98^ which generally 
run on the Intel Corporation’s microprocessors; Windows NT^ which is 
generally found on more powerful C/nix-class machines, and Windows CE^ 
for small electronic devices. 

Windows 95/98 operates personal computers. 

Windows CE operates handheld computers, consumer electronics devices. 

Windows NT operates workstations and large servers for networking. 

Wine Windows emulator: Software designed to imitate the Microsoft Win- 
dows operating system. 

wintel a term for the combination of Windows operating systems and Intel 
microprocessors found on more than 80 percent of all computers sold 
today. 

word category see part of speech. 

wordlength number of bits per sample value. The wordlength determines 
the precision of a value. 

word processor software for creating text documents. 

word spotting automatic recognition of selected words (such as “wheat”) 
usually in a large amount of data such as obtained from the tapping of 
thousands of telephone lines. 

word type a specific word. (Cf. also token: a group of letters that qualifies 
as a word.) Example: the string “funky, funky, funky” contains one word 
type, “funky,” but three tokens, “A(funky), B(funky), C(funky).” 
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workstation powerful microcomputer used in computer-aided design^ elec- 
tronic publishing, or other graphics intensive processing. 

World Wide Web a vast, disparate network of pages of data and programs 
on the Internet, connected to one another via hyperlinks — a technology 
that lets users jump from one item to another by clicking with a mouse 
on a word or icon that points to some other part of the network. The 
web is the platform for most electronic commerce and publishing on the 
Internet. 

WWW World Wide Web (also, sarcastically. World Wide Wait). 

XML extensible markup language. A markup language that allows a devel- 
oper to dehne his own tags and specify the types of their contents. 

XOR the Exclusive Or or XOR function. 

year 2000 problem (Y2K) the book-keeping problem resulting from the 
fact that most computer programs did not envisage intelligent life after 
the year 1999. As a result, the year 2000 is interpreted as the year 1900 
with potentially disastrous consequences in commerce, banking, health 
care, and almost every other kind of human activity. (Of course the use 
of just two digits to designate a year in a given century far antedates 
computers.) 

Y2K the year 2000 problem. 

zero an antiresonance in a signal or transfer function. 

Zip drive a drive that reads and writes zip diskettes. A single zip diskette 
can store up to 100 megabytes of data. 

Zipf ’s law states that a word type ’s number of occurrences is inversely pro- 
portional to its rank in the list of word types ordered by their frequency, 
see p. 78. 
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